Actual Gausality 
00? = 


—A A CA LAA 4 


Joseph Y. Halpern 


Actual Causality 


Downloaded from http://direct.mit.edu/books/book-pdf/2262849/book_9780262336611.pdf by guest on 04 April 2024 


Downloaded from http://direct.mit.edu/books/book-pdf/2262849/book_9780262336611.pdf by guest on 04 April 2024 


Actual Causality 


Joseph Y. Halpern 


The MIT Press 
Cambridge, Massachusetts 
London, England 


Downloaded from http://direct.mit.edu/books/book-pdf/2262849/book_9780262336611.pdf by guest on 04 April 2024 


©2016 Massachusetts Institute of Technology 


All rights reserved. No part of this book may be reproduced in any form by any electronic 
or mechanical means (including photocopying, recording, or information storage and 
retrieval) without permission in writing from the publisher. 


This book was set in I4TxX by Joseph Y. Halpern. Printed and bound in the United States of America. 
Library of Congress Cataloging-in-Publication Data 


Names: Halpern, Joseph Y., 1953 

Title: Actual causality / Joseph Y. Halpern. 

Description: Cambridge, MA : The MIT Press, [2016] — Includes bibliographical 
references and index. 

Identifiers: LCCN 2016007205 — ISBN 9780262035026 (hardcover : alk. paper) 

Subjects: LCSH: Functional analysis. — Probabilities. — Causation. 

Classification: LCC QA320 .H325 2016 — DDC 122dc23 LC record available at 

http://lccn.loc.gov/2016007205 


109 8 765 43 2 1 


Downloaded from http://direct.mit.edu/books/book-pdf/2262849/book_9780262336611.pdf by guest on 04 April 2024 


For David, Sara, and Daniel (yet again). It’s hard to believe 
that you’re all old enough now to read and understand what 
I write (although I’m not sure you will want to!). The theory 
of causality presented here allows there to be more than one 
cause, and for blame to be shared. You’re all causes of my hap- 
piness (and have, at various times, been causes of other states 
of mind). I’m happy to share some blame with Mom for how 
you turned out. 
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Preface 


In The Graduate, Benjamin Braddock (Dustin Hoffman) is told that the future can be summed 
up in one word: “Plastics”. I still recall that in roughly 1990, Judea Pearl told me (and anyone 
else who would listen!) that the future was in causality. I credit Judea with inspiring my in- 
terest in causality. We ended up writing a paper on actual causality that forms the basis of this 
book. I was fortunate to have other coauthors who helped me on my (still ongoing) journey 
to a deeper understanding of causality and related notions, particularly Hana Chockler, Chris 
Hitchcock, and Tobi Gerstenberg. Our joint work features prominently in this book as well. 

I thank Sander Beckers for extended email discussions on causality, Robert Maxton for 
his careful proofreading, and Chris Hitchcock, David Lagnado, Jonathan Livengood, Robert 
Maxton, and Laurie Paul, for their terrific comments on earlier drafts of the book. Of course, 
Iam completely responsible for any errors that remain. 

My work on causality has been funded in part by the NSF, AFOSR, and ARO; an NSF 
grant for work on Causal Databases was particularly helpful. 


1X 


Downloaded from http://direct.mit.edu/books/book-pdf/2262849/book_9780262336611.pdf by guest on 04 April 2024 


Downloaded from http://direct.mit.edu/books/book-pdf/2262849/book_9780262336611.pdf by guest on 04 April 2024 


Chapter 1 


Introduction and Overview 


Felix qui potuit rerum cognoscere causas. (Happy is one who could understand 
the causes of things.) 


Virgil, Georgics 11.490 


Causality plays a central role in the way people structure the world. People constantly seek 
causal explanations for their observations. Why is my friend depressed? Why won’t that file 
display properly on my computer? Why are the bees suddenly dying? 

Philosophers have typically distinguished two notions of causality, which they have called 
type causality (sometimes called general causality) and actual causality (sometimes called 
token causality or specific causality; I use the term “actual causality” throughout the book). 
Type causality is perhaps what scientists are most concerned with. These are general state- 
ments, such as “smoking causes lung cancer” and “printing money causes inflation”. By way 
of contrast, actual causality focuses on particular events: “the fact that David smoked like a 
chimney for 30 years caused him to get cancer last year”; “the fact that an asteroid struck the 
Yucatan Peninsula roughly 66 million years ago caused the extinction of the dinosaurs”; “the 
car’s faulty brakes caused the accident (not the pouring rain or the driver’s drunkenness)”. 

The reason that scientists are interested in type causality is that a statement of type causality 
allows them to make predictions: “if you smoke a lot, you will get (or, perhaps better, are 
likely to get) lung cancer’; “if a government prints $1 billion, they are likely to see 3% 
inflation”. Because actual causality talks about specific instances, it is less useful for making 
predictions, although it still may be useful in understanding how we can prevent outcomes 
similar to that specific instance in the future. We may be interested in knowing that Robert’s 
sleeping in caused him to miss the appointment, because it suggests that getting him a good 
alarm clock would be useful. 

Actual causality is also a critical component of blame and responsibility assignment. That 
is perhaps why it arises so frequently in the law. In the law, we often know the relevant facts 
but still need to determine causality. David really did smoke and he really did die of lung 
cancer; the question is whether David’s smoking was the cause of him dying of lung cancer. 
Similarly, the car had faulty brakes, it was pouring rain, and the driver was drunk. Were the 
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faulty brakes the cause of the accident? Or was it the rain or the driver’s drunkenness? Of 
course, as the dinosaur-extinction example shows, questions of actual causality can also be 
of great interest to scientists. And issues of type causality are clearly relevant in the law as 
well. A jury is hardly likely to award large damages to David’s wife if they do not believe that 
smoking causes cancer. 

As these examples show, type causality is typically forward-looking and used for predic- 
tion. Actual causality, in contrast, is more often backward-looking. With actual causality, we 
know the outcome (David died of lung cancer; there was a car accident), and we retrospec- 
tively ask why it occurred. Type causality and actual causality are both of great interest and 
are clearly intertwined; however, as its title suggests, in this book, I focus almost exclusively 
on actual causality and notions related to it. 

Most of the book focuses on how to define these notions. What does it mean to say that 
the faulty brakes were the cause of the accident and not the driver’s drunkenness? What does 
it mean to say that one voter is more responsible than another for the outcome of an election? 
Defining causality is surprisingly difficult; there have been many attempts to do so, going 
back to Aristotle. The goal is to get a definition that matches our natural language usage of 
the word “causality” (and related words such as “responsibility” and “blame”’) and is also 
useful, so that it can be used, for example, to provide guidance to a jury when deciding a legal 
case, to help a computer scientist decide which line in a program is the cause of the program 
crashing, and to help an economist in determining whether instituting an austerity program 
caused a depression a year later. 

The modern view of causality arguably dates back to Hume. (For the exact citation, see 
the notes at the end of this chapter.) Hume wrote: 


We may define a cause to be an object followed by another, and where all the 
objects, similar to the first, are followed by objects similar to the second. Or, in 
other words, where, if the first object had not been, the second never had existed. 


Although Hume says “in other words”, he seems to be conflating two quite different notions of 
causality here. The first has been called the regularity definition. According to this definition, 
we can discern causality by considering just what actually happens, specifically, which events 
precede others. Roughly speaking, A causes B if As typically precede Bs. (Note that this is 
type causality.) 

By way of contrast, the second notion involves counterfactuals, statements counter to fact. 
It does not consider just what actually happened but also what might have happened, had 
things been different. Research in psychology has shown that such counterfactual thinking 
plays a key role in determining causality. People really do consider “what might have been” 
as well as “what was”. 

A recent experiment illustrates this nicely. Compare Figures 1.1(a) and (b). In both, ball 
A is heading toward a goal. In Figure 1.1(a), its path is blocked by a brick; in Figure 1.1(b), 
it is not blocked. Now ball B hits A. As a result, instead of going straight toward the goal, 
A banks off the wall and goes into the goal. In both figures, the same thing happens: B hits 
A, then A banks off the wall and goes into the goal. But when people are asked whether B 
hitting A caused A to go into the goal, they typically answer “yes” for Figure 1.1(a) and “no” 
for Figure 1.1(b). Thus, they are clearly taking the counterfactual (what would have happened 
had A not hit B) into account when they answer the question. 
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(a) Brick would have blocked ball A. (b) Brick would not have blocked ball A. 
Figure 1.1: Considering counterfactuals. 


There have been many definitions of causality that involve counterfactuals. In this book, I 
consider only one of them (although it comes in three variants), which Judea Pearl and I de- 
veloped; this has been called the Halpern-Pearl definition; I call it simply the “HP definition” 
throughout this book. I focus on this definition partly because it is the one that I like best and 
am most familiar with. But it has a further advantage that is quite relevant to the purposes 
of this book: it has been extended to deal with notions such as responsibility, blame, and ex- 
planation. That said, many of the points that I make for the HP definition can be extended to 
other approaches to causality involving counterfactuals. 

Pretty much all approaches involving counterfactuals take as their starting point what has 
been called in the law the “but-for” test. A is a cause of B if, but for A, B would not have 
happened. Put another way, this says that the occurrence of A is necessary for the occurrence 
of B. Had A not occurred, B would not have occurred. When we say that A caused B, we 
invariably require (among other things) that A and B both occurred, so when we contemplate 
A not happening, we are considering a counterfactual. This kind of counterfactual reasoning 
clearly applies in the example above. In Figure 1.1(a), had A not hit B, B would not have 
landed in the goal, so A hitting B is viewed as a cause of B going in the goal. In contrast, in 
Figure 1.1(b), B would have gone in the goal even if it had not been hit by A, so A hitting B 
is not viewed as a cause of B going in the goal. 

But the but-for test is not always enough to determine causality. Consider the following 
story, to which I will return frequently: 


Suzy and Billy both pick up rocks and throw them at a bottle. Suzy’s rock 
gets there first, shattering the bottle. Because both throws are perfectly accu- 
rate, Billy’s would have shattered the bottle had it not been preempted by Suzy’s 
throw. 


Here the but-for test fails. Even if Suzy hadn’t thrown, the bottle would have shattered. 
Nevertheless, we want to call Suzy’s throw a cause of the bottle shattering and do not want to 
call Billy’s throw a cause. 

The HP definition deals with this problem by allowing the but-for test to be applied under 
certain contingencies. In the case of Suzy and Billy, we consider the contingency where Billy 
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does not throw a rock. Clearly, if Billy does not throw, then had Suzy not thrown, the bottle 
would not have shattered. 


There is an obvious problem with this solution: we can also consider the contingency 
where Suzy does not throw. Then the but-for definition would say that Billy’s throw is also 
a cause of the bottle shattering. But this is certainly not what most people would say. After 
all, it was Suzy’s rock that hit the bottle (duh ...). To capture the “duh” reaction, the defini- 
tion of causality needs another clause. Roughly speaking, it says that, under the appropriate 
contingencies, Suzy’s throw is sufficient for the bottle to shatter. By requiring sufficiency to 
hold under an appropriate set of contingencies, which takes into account what actually hap- 
pened (in particular, that Billy’s rock didn’t hit the bottle), we can declare Suzy’s throw to be 
a cause but not Billy’s throw. Making this precise in a way that does not fall prey to numerous 
counterexamples is surprisingly difficult. 


I conclude this introduction with a brief overview of the rest of the book. In the next chap- 
ter, I carefully formalize the HP definition. Doing so requires me to introduce the notion of 
structural equations. The definition involves a number of technical details; I try to introduce 
them as gently as possible. I then show how the definition handles the rock-throwing example 
and a number of other subtle examples that have come up in the philosophy and law litera- 
tures. I also show how it deals with issues of transitivity, and how it can be extended to deal 
with situations where we have probabilities on outcomes. 


There are several examples that the basic HP definition does not handle well. It is well 
known in the psychology literature that when people evaluate causality, they take into account 
considerations of normality and typicality. As Kahneman and Miller have pointed out, “[in 
determining causality], an event is more likely to be undone by altering exceptional than 
routine aspects of the causal chain that led to it’. In Chapter 3, I present an extension of the 
basic HP definition that takes normality into account and show that the extension deals with 
some of the problematic cases mentioned in Chapter 2. 


The HP definition of causality is model-relative. A can be a cause of B in one model but 
not in another. As a result, two opposing lawyers can disagree about whether A is a cause of B 
even if they are both working with the HP definition; they are simply using different models. 
This suggests that it is important to have techniques for distinguishing whether one model is 
more appropriate than another. More generally, we need techniques for deciding what makes 
a model “good”. These issues are the focus of Chapter 4. 


For an approach to causality to be psychologically plausible, humans need to be able to 
represent causal information in a reasonably compact way. For an approach to be computa- 
tionally feasible, the questions asked must be computable in some reasonable time. Issues of 
compact representations and computational questions are discussed in Chapter 5. In addition, 
I discuss an axiomatization of the language of causality. 


Causality is an all-or-nothing concept; either A is a cause of B or it is not. The HP defini- 
tion would declare each voter in an 11-0 vote a cause of the outcome; it would also call each 
of the six people who voted in favor of the outcome a cause in a 6—5 victory. But people tend 
to view the degree of responsibility as being much lower in the first case than the second. In 
Chapter 6, I show how the HP definition can be extended to provide notions of responsibility 
and blame that arguably capture our intuitions in these cases. 


Downloaded from http://direct.mit.edu/books/book-pdf/2262849/book_9780262336611.pdf by guest on 04 April 2024 


Notes 5 


Causality and explanation are clearly closely related. If A is a cause of B, then we think 
of A as providing an explanation as to why B happened. In Chapter 7, I extend the structural- 
model approach to causality to provide a definition of explanation. 

I conclude in Chapter 8 with some discussion of where we stand and some examples of 
applications of causality. 

I have tried to write this book in as modular a fashion as possible. In particular, all the 
chapters following Chapter 2 can be read independently of each other, in any order, except 
that Sections 4.4, 4.7, 5.2, and 6.3 depend on Section 3.2. Even in Chapter 2, the only material 
that is really critical in later chapters is Sections 2.1—2.3. I have tried to make the main text in 
each chapter as readable as possible. Thus, I have deferred technical proofs to a final section 
in each chapter (which can be skipped without loss of continuity) and have avoided in some 
cases a careful discussion of the many viewpoints on some of the more subtle issues. There are 
a few proofs in the main text; these are ones that I would encourage even non-mathematical 
readers to try reading, just to get comfortable with the notation and concepts. However, they 
can also be skipped with no loss of continuity. 

Each chapter ends with a section of notes, which provides references to material discussed 
in the chapter, pointers to where more detailed discussions of the issues can be found, and, 
occasionally, more detail on some material not covered in the chapter. 

There has been a great deal of work on causality in philosophy, law, economics, and psy- 
chology. Although I have tried to cover the highlights, the bibliography is by no means ex- 
haustive. Still, I hope that there is enough material there to guide the interested reader’s 
exploration of the literature. 


Notes 


The hypothesis that the dinosaur extinction was due to a meteor striking the earth was pro- 
posed by Luis and Walter Alvarez [Alvarez et al. 1980]. While originally viewed as quite 
radical, this hypothesis is getting increasing support from the evidence, including evidence of 
a meteor strike in Chicxulub in the Yucatan Peninsula at the appropriate time. Palike [2013] 
provides a summary of research on the topic. 

The distinction between actual causality and type causality has been related to the dis- 
tinction between causes of effects—that is, the possible causes of a particular outcome—and 
effects of causes—that is, the possible effects of a given event. Some have claimed that rea- 
soning about actual causality is equivalent to reasoning about causes of effects, because both 
typically focus on individual events (e.g., the particular outcome of interest), whereas rea- 
soning about type causality is equivalent to reasoning about effects of causes, because when 
thinking of effects of causes, we are typically interested in understanding whether a certain 
type of behavior brings about a certain type of outcome. Although this may typically be the 
case, it is clearly not always the case; Livengood [2015] gives a number of examples that 
illustrate this point. Dawid [2007] argued that reasoning about effects of causes is typically 
forward-looking and reasoning about causes of effects is typically backward-looking. The 
idea that type causality is forward-looking and actual causality is backward-looking seems to 
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have been first published by Hitchcock [2013], although as the work of Dawid [2007] shows, 
it seems to have been in the air for a while. 

The exact relationship between actual and type causality has been a matter of some debate. 
Some have argued that two separate theories are needed (see, e.g., [Eells 1991; Good 1961a; 
Good 1961b; Sober 1984]); others (e.g., [Hitchcock 1995]) have argued for one basic notion. 
Statisticians have largely focused on type causality and effects of causes; Dawid, Faigman, 
and Fienberg [2014] discuss how statistical techniques can be used for causes of effects. My 
current feeling is that type causation arises from many instances of actual causation, so that 
actual causation is more fundamental, but nothing in this book hinges on that. 

Work on causality goes back to Aristotle [Falcon 2014]. Hume [1748] arguably started 
more modern work on causality. Robinson [1962] discusses Hume’s two definitions of causal- 
ity in more detail. Perhaps the best-known definition involving regularity (in the spirit of the 
first sentence in the earlier Hume quotation) is due to Mackie [1974], who introduced what 
has been called the JNUS condition: A is a cause of B if A is an insufficient but necessary part 
of a condition that is itself unnecessary but sufficient for B. The basic intuition behind this 
condition is that A should be a cause of B if A is necessary and sufficient for B to happen, 
but there are many counterexamples to this simple claim. Ball A hitting B is not necessary 
for B to go into the goal, even in Figure |.1(b); removing the brick would also work. Mackie 
modified this basic intuition by taking A to be a cause of B if there exist X and Y such that 
adding (AA X)VY is necessary and sufficient for B, but neither A nor X by itself is sufficient 
to entail B. If this definition is taken apart carefully, it gives us INUS: 


« Ais, by assumption, insufficient for B (that’s the I in INUS). 


« Ais anecessary part of condition sufficient for B, namely, A \ X (that’s the N and the 
S)—A is necessary because X by itself is not sufficient. Note that A is necessary in 
this A \ X because, by assumption, X by itself is not sufficient. 


» AA X is unnecessary for B (that’s the U) because there is another cause of B, namely, 
¥. 


Note that the notion of necessity and sufficiency used here is related to, but different from, that 
used when considering counterfactual theories of causality. While Mackie’s INUS condition 
is no longer viewed as an appropriate definition of causality (e.g., it has trouble with cases 
of preemption such as the rock-throwing example, where there is a backup ready to bring 
about the outcome in case what we would like to call the actual cause fails to do so), work 
on defining causality in terms of regularity still continues; Baumgartner [2013] and Strevens 
[2009] are recent exemplars. 

I make no attempt to do justice to all the alternative approaches to defining causality; my 
focus is just the HP definition, which was introduced in [Halpern and Pearl 2001; Halpern and 
Pearl 2005a]. As I said, the HP definition is formalized carefully in Chapter 2. There is some 
discussion of alternative approaches to defining causality in the notes to Chapter 2. Paul and 
Hall [2013] provide a good overview of work on causality, along with a critical analysis of 
the strengths and weaknesses of various approaches. 

The experiment with ball B hitting ball A is discussed by Gerstenberg, Goodman, 
Lagnado, and Tenenbaum [2014]. Figure 1.1 is taken from their paper (which also reports 
on a number of other related experiments). 
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The rock-throwing example is due to Lewis [2000]. It is but one of many examples of 
preemption in both the law and philosophy literatures. 

Hart and Honoré [1985] discuss the but-for condition and, more generally, the use of actual 
causality in the law; it is the classic reference on the topic. Honoré [2010] and Moore [2009] 
provide more recent discussions. Lewis [1973a] initiated modern work in philosophy on 
counterfactual theories of causation; Menzies [2014] provides an overview of the work on 
the counterfactual approach to causation in philosophy. Sloman [2009] discusses how people 
ascribe causality, taking counterfactuals into account. 

Although the focus in this book is on work on causality in philosophy, law, and psychology, 
causality also has a long tradition in other fields. The seminal work of Adam Smith [1994], 
considered a foundation of economics, is titled “An Inquiry into the Nature and Causes of 
the Wealth of Nations”. Hoover [2008] provides an overview of work on causality in eco- 
nomics. Statistics is, of course, very concerned with causal influence; Pearl [2009] provides 
an overview of work in statistics on causation. Discussion of causality in physics comes 
up frequently, particularly in the context of the Einstein-Podolsky-Rosen paradox, where the 
question is whether A can be a cause of B even though B happened so shortly after A that a 
signal traveling at the speed of light from A could not have reached B (see [Bell 1964]). More 
recently, (the HP definition of) causality has been applied in computer science; I discuss some 
of these applications in Chapter 8. The volume edited by Ilari, Russo, and Williamson [2011] 
provides a broad overview of the role of causality in the social sciences, health sciences, and 
physical sciences. 
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Chapter 2 


The HP Definition of Causality 


The fluttering of a butterfly’s wing in Rio de Janeiro, amplified by atmospheric 
currents, could cause a tornado in Texas two weeks later. 


Edward Lorenz 


There is only one constant, one universal. It is the only real truth. Causality. 
Action, reaction. Cause and effect. 


Merovingian, The Matrix Reloaded 


In this chapter, I go through the HP definition in detail. The HP definition is a formal, math- 
ematical definition. Although this does add some initial overhead, it has an important advan- 
tage: it prevents ambiguity about whether A counts as a cause of B. There is no need, as in 
many other definitions, to try to understand how to interpret the words. For example, recall the 
INUS condition from the notes in Chapter 1. For A to be a cause of B under this definition, 
A has to be a necessary part of a condition that is itself unnecessary but insufficient for B. 
But what is a “condition”? The formalization of INUS suggests that it is a formula or set of 
formulas. Is there any constraint on this set? What language is it expressed in? 

This lack of ambiguity is obviously critical if we want to apply causal reasoning in the 
law. But as we shall see in Chapter 8, it is equally important in other applications of causality, 
such as program verification, auditing, and database queries. However, even if there is no 
ambiguity about the definition, it does not follow that there can be no disagreement about 
whether A is a cause of B. 

To understand how this can be the case, it is best to outline the general approach. The 
first step in the HP definition involves building a formal model in which causality can be 
determined unambiguously. Among other things, the model determines the language that is 
used to describe the world. We then define only what it means for A to be a cause of B in 
model M. It is possible to construct two closely related models MM) and M2 such that A is a 
cause of B in M, but not in M2. I do not believe that there is, in general, a “right” model; 
in any case, the definition is silent on what makes one model better than another. (This is an 
important issue, however. I do think that there are criteria that can help judge whether one 
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model is better than another; see below and Chapter 4 for more on this point.) Here we already 
see one instance where, even if there is agreement regarding the definition of causality, we can 
get disagreements regarding causality: there may be disagreement about which model better 
describes the real world. This is arguably a feature of the definition. It moves the question 
of actual causality to the right arena—debating which of two (or more) models of the world 
is a better representation of those aspects of the world that one wishes to capture and reason 
about. This, indeed, is the type of debate that goes on in informal (and legal) arguments all 
the time. 


2.1 Causal Models 


The model assumes that the world is described in terms of variables; these variables can take 
on various values. For example, if we are trying to determine whether a forest fire was caused 
by lightning or an arsonist, we can take the world to be described by three variables: 


« FF for forest fire, where /’F' = 1 if there is a forest fire and F'F' = 0 otherwise; 
= 1 for lightning, where L = 1 if lightning occurred and L = 0 otherwise; 


= MD for match dropped (by arsonist), where MD = 1 if the arsonist dropped a lit match 
and MD = 0 otherwise. 


If we are considering a voting scenario where there are eleven voters voting for either Billy 
or Suzy, we can describe the world using twelve variables, V,,...,Vi1, W, where V; = 0 if 
voter 7 voted for Billy and V; = 1 if voter 7 voted for Suzy, forz = 1,...,11, W = Oif Billy 
wins, and W = 1 if Suzy wins. 

In these two examples, all the variables are binary, that is, they take on only two values. 
There is no problem allowing a variable to have more than two possible values. For example, 
the variable V; could be either 0, 1, or 2, where V; = 2 if 2 does not vote; similarly, we could 
take W = 2 if the vote is tied, so neither Billy nor Suzy wins. 

The choice of variables determines the language used to frame the situation. Although 
there is no “right” choice, clearly some choices are more appropriate than others. For example, 
if there is no variable corresponding to smoking in model M, then in /, smoking cannot be 
a cause of Sam’s lung cancer. Thus, if we want to consider smoking as a potential cause of 
lung cancer, M is an inappropriate model. (As an aside, the reader may note that here and 
elsewhere I put “right” in quotes; that is because it is not clear to me that the notion of being 
“right” is even well defined.) 

Some variables may have a causal influence on others. This influence is modeled by a set 
of structural equations. For example, if we want to model the fact that if the arsonist drops 
a match or lightning strikes then a fire starts, we could use the variables MD, FF, and L as 
above, with the equation FF’ = max(L, MD); that is, the value of the variable FF is the 
maximum of the values of the variables MD and L. This equation says, among other things, 
that if MD = O and L = 1, then FF = 1. The equality sign in this equation should be thought 
of more like an assignment statement in programming languages; once we set the values of 
MD and L, the value of F'F is set to their maximum. However, despite the equality, a forest 
fire starting some other way does not force the value of either 7D or L to be 1. 
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Alternatively, if we want to model the fact that a fire requires both a lightning strike and 
a dropped match (perhaps the wood is so wet that it needs two sources of fire to get going), 
then the only change in the model is that the equation for FF becomes FF = min(L, MD); 
the value of FF is the minimum of the values of //D and L. The only way that FF = 1 is if 
both L = land MD = 1. 


Just a notational aside before going on: I sometimes identify binary variables with primitive 
propositions in propositional logic. And, as in propositional logic, the symbols A, V, and — are 
used to denote conjunction, disjunction, and negation, respectively. With that identification, 
instead of writing max(L, MD), I write L V MD; instead of writing min(L, MD), I write 
L / MD; and instead of writing 1 — L or 1 — MD, I write —L or ~MD. Most people seem to 
find the logic notation easier to absorb. I hope that the intention will be clear from context. 


Going on with the forest-fire example, it is clear that both of these models are somewhat 
simplistic. Lightning does not always result in a fire, nor does dropping a lit match. One way 
of dealing with this would be to make the assignment statements probabilistic. For example, 
we could say that the probability that FF’ = 1 conditional on L = 1 is .8. I discuss this 
approach in more detail in Section 2.5. It is much simpler to think of all the equations as 
being deterministic and then use enough variables to capture all the conditions that determine 
whether there is a forest fire. One way to do this is simply to add those variables explicitly. 
For example, we could add variables that talk about the dryness of the wood, the amount of 
undergrowth, the presence of sufficient oxygen (fires do not start so easily on the top of high 
mountains), and so on. If a modeler does not want to add all these variables explicitly (the 
details may simply not be relevant to the analysis), then another alternative is to use a single 
variable, say U, which intuitively incorporates all the relevant factors, without describing 
them explicitly. The value of U would determine whether the lightning strikes and whether 
the match is dropped by the arsonist. 

The value of U could also determine whether both the match and lightning are needed 
to start a fire or just one of them suffices. For simplicity, rather than using U in this way, 
I consider two causal models. In one, called the conjunctive model, both the match and 
lightning are needed to start the forest fire; in the other, called the disjunctive model, only 
one is needed. In each of these models, U determines only whether the lightning strakes 
and whether the match is dropped. Thus, I assume that U can take on four possible values 
of the form (7,7), where i and j are each either 0 or 1. Intuitively, i describes whether the 
external conditions are such that the lightning strikes (and encapsulates all such conditions, 
e.g., humidity and temperature), and 7 describes whether the arsonist drops the match (and 
thus encapsulates all the psychological conditions that determine this). For future reference, 
let U; and U2 denote the components of the value of U in this example, so that if U = (i, 7), 
then U, =72 and U2 = 7. 

Here I have assumed a single variable U that determines the values of both L and MD. I 
could have split U into two variables, say Uz, and Uyyp, where U;, determines the value of L 
and Ujyyp determines the value of I/D. It turns out that whether we use a single variable with 
four possible values or two variables, each with two values, makes no real difference. (Other 
modeling choices can make a big difference—I return to this point below.) 

It is reasonable to ask why I have chosen to describe the world in terms of variables and 
their values, related via structural equations. Using variables and their values is quite standard 
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in fields such as statistics and econometrics for good reason; it is a natural way to describe 
situations. The examples later in this section should help make that point. It is also quite close 
to propositional logic; we can think of a primitive proposition in classical logic as a binary 
variable, whose values are either true or false. Although there may well be other reasonable 
ways of representing the world, this seems like a useful approach. 

As for the use of structural equations, recall that my goal is to give a definition of actual 
causality in terms of counterfactuals. Certainly if the but-for test applies, so that, but for A, B 
would not have happened, I want the definition to declare that A is a cause of B. That means 
I need a model rich enough to capture the fact that, but for A, B would not have happened. 
Because I model the world in terms of variables and their values, the A and the B in the 
definition will be statements such as X = x and Y = y. Thus, I want to say that if X had 
taken on some value x’ different from x, then Y would not have had value y. To do this, I 
need a model that makes it possible to consider the effect of intervening on X and changing its 
value from x to x’. Describing the world in terms of variables and their values makes it easy 
to describe such interventions; using structural equations makes it easy to define the effect of 
an intervention. 

This can be seen already in the forest-fire example. In the conjunctive model, if we 
see a forest fire, then we can say that if the arsonist hadn’t dropped a match, there would 
have been no forest fire. This follows in the model because setting MD = O makes 
FF = min(L, MD) = 0. 

Before going into further details, I give an example of the power of structural equations. 
Suppose that we are interested in the relationship between education, an employee’s skill 
level, and his salary. Suppose that education can affect skill level (but does not necessarily 
always do so, since a student may not pay attention or have poor teachers) and skill level 
affects salary. We can give a simplified model of this situation using four variables: 


» F for education level, with values 0 (no education), 1 (one year of education), and 2 (2 
years of education); 


= SZ for skill level, with values 0 (low), 1 (medium), and 2 (high); 


« S for salary, which for simplicity we can also take to have three values: 0 (low), 1 
(medium), and 2 (high); 


= U,a variable that determines whether education will have an impact on skill level. 


There are two relevant equations. The first says that education determines skill level, provided 
that the external conditions (determined by U) are favorable, and otherwise has no impact: 


E ifU=1 
si { 0 ifU=0. 


The second says that skill level determines salary: 
S = SL. 


Although this model significantly simplifies some complicated relationships, we can already 
use it to answer interesting questions. Suppose that, in fact, we observe that Fred has salary 
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level 1. We can thus infer that U = EK = SL = 1. We can ask what Fred’s salary level 
would have been had he attained education level 2. This amounts to intervening and setting 
E = 2. Since U = 1, it follows that SZ = 2 and thus S = 2; Fred would have a high salary. 
Similarly, if we observe that Fred’s education level is | but his salary is low, then we can infer 
that U = 0 and SZ = 0, and we can say that his salary would still be low even if he had an 
extra year of education. 

A more sophisticated model would doubtless include extra factors and more sensitive de- 
pendencies. Still, even at this level, I hope it is clear that the structural equations framework 
can give reasonable answers. 

When working with structural equations, it turns out to be conceptually useful to split 
the variables into two sets: the exogenous variables, whose values are determined by factors 
outside the model, and the endogenous variables, whose values are ultimately determined by 
the exogenous variables. In the forest-fire example, the variable U, which determines whether 
the lightning strikes and whether the arsonist drops a match, is exogenous; the variables F'F’, 
L,and MD are endogenous. The value of U determines the values of L and MD, which in turn 
determine the value of FF’. We have structural equations for the three endogenous variables 
that describe how this is done. In general, there is a structural equation for each endogenous 
variable, but there are no equations for the exogenous variables; as I said, their values are 
determined by factors outside the model. That is, the model does not try to “explain” the 
values of the exogenous variables; they are treated as given. 

The split between exogenous and endogenous variables has another advantage. The struc- 
tural equations are all deterministic. As we shall see in Section 2.5, when we want to talk 
about the probability that A is a cause of B, we can do so quite easily by putting a probability 
distribution on the values of the exogenous variables. This gives us a way of talking about the 
probability of there being a fire if lightning strikes, while still using deterministic equations. 

In any case, with this background, we can formally define a causal model M as a pair 
(S, F), where S is a signature, which explicitly lists the endogenous and exogenous variables 
and characterizes their possible values, and F defines a set of structural equations, relating the 
values of the variables. In the next two paragraphs, I define S and F formally; the definitions 
can be skipped by the less mathematically inclined reader. 

A signature S is a tuple (/, V, R), where U is a set of exogenous variables, V is a set of 
endogenous variables, and F associates with every variable Y € 7/UV a nonempty set R(Y ) 
of possible values for Y (i.e., the set of values over which Y ranges). As suggested earlier, 
in the forest-fire example, we can take / = {U}; that is, U is exogenous, R(U) consists of 
the four possible values of U discussed earlier, V = {FF ,L, MD}, and R(FF) = R(L) = 
R(MD) = {0,1}. 

The function ¥ associates with each endogenous variable X € VY a function denoted Fx 
such that Fy maps x zequy—{x})R(Z) to R(X). (Recall that /UV—{X } is the set consist- 
ing of all the variables in either // or V that are not in { X }. The notation x ze@uvy_—{x})R(Z) 
denotes the cross-product of R(Z), where Z ranges over the variables in / UV — {X}; thus, 
ifU UV —{X} = {%,..., Zp}, then x zeq~uy_{x})R(Z) consists of tuples of the form 
(21,---, 2k), where z; is a possible value of Z;, fori = 1,...,k.) This mathematical notation 
just makes precise the fact that fx determines the value of X, given the values of all the 
other variables in 7/ U V. In the running forest-fire example, Ff’ would depend on whether 
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we are considering the conjunctive model (where both the lightning strike and dropped match 
are needed for the fire) or the disjunctive model (where only one of the two is needed). In 
the conjunctive model, as I have already noted, Frr(L, MD,U) = 1 iff (“iff” means “if and 
only if’—this standard abbreviation is used throughout the book) min(L, MD) = 1; in the 
disjunctive model, Frr(L, MD,U) = 1 iff max(L, MD) = 1. Note that, in both cases, the 
value of rp is independent of U; it depends only on the values of L and MD. Put another 
way, this means that when we hold the values of L and MD fixed, changing the value of U 
has no effect on the value of /’r7r. While the values of I/D and L do depend on the value of 
U, writing things in this way allows us to consider the effects of external interventions that 
may override U. For example, we can consider Fr (0,1, (1, 1)); this tells us what happens 
if the lightning does not strike and the match is dropped, even if the value of U is such that 
L = 1. In what follows, we will be interested only in interventions on endogenous variables. 
As I said, we take the values of the exogenous variables as given, so do not intervene on them. 


The notation Fy (Y, Y’, U) for the equation describing how the value of X depends on Y, 
Y’, and U is not terribly user-friendly. I typically simplify notation and write X = Y +U 
instead of Fy (Y,Y’,U) = Y +U. (Note that the variable Y’ does not appear on the right- 
hand side of the equation. That means that the value of X does not depend on that of Y’.) With 
this simplified notation, the equations for the forest-fire example are L = U,, MD = Ug, and 
either FF = min(L, MD) or FF = max(L, MD), depending on whether we are considering 
the conjunctive or disjunctive model. 


Although I may write something like X = Y + U, the fact that X is assigned Y + U does 
not imply that Y is assigned X — U; that is, Fy (X,Y’,U) = X — U does not necessarily 
hold. The equation X = Y + U implies that if Y = 3 and U = 2, then X = 5, regardless 
of how Y’ is set. Going back to the forest-fire example, setting FF’ = 0 does not mean that 
the match is “undropped”! This asymmetry in the treatment of the variables on the left- and 
right-hand sides of the equality sign means that (the value of) Y can depend on (the value of) 
X without X depending on Y. This, in turn, is why causality is typically asymmetric in the 
formal definition that I give shortly: if A is a cause of B, then B will typically not be a cause 
of A. (See the end of Section 2.7 for more discussion of this issue.) 


As I suggested earlier, the key role of the structural equations is that they allow us to 
determine what happens if things had been other than they were, perhaps due to an external 
intervention; for example, the equations tell us what would happen if the arsonist had not 
dropped a match (even if in fact he did). This will be critical in defining causality. 


Since a world in a causal model is described by the values of variables, understanding 
what would happen if things were other than they were amounts to asking what would happen 
if some variables were set to values perhaps different from their actual values. Setting the 
value of some variable X to x in a causal model M = (S,F) results in a new causal model 
denoted Mx... In the new causal model, the equation for X is simple: X is just set to 
x; the remaining equations are unchanged. More formally, My. = (S,Fx_x«), where 
Fx xz 1s the result of replacing the equation for X in F by X = x and leaving the remaining 
equations untouched. Thus, if / C isthe conjunctive model of the forest fire, then in Z Deo 
the model that results from intervening in M© by having the arsonist not drop a match, the 
equation MD = Uz, is replaced by MD = 0. 
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As I said in Chapter 1, the HP definition involves counterfactuals. The equations in a 
causal model can be given a straightforward counterfactual interpretation. An equation such 
as x = F’y(u, y) should be thought of as saying that in a context where the exogenous variable 
U has value u, if Y were set to y by some means (not specified in the model), then X would 
take on the value x. As I noted earlier, when Y is set to y, this can override the value of Y 
according to the equations. For example, 17D can be set to 0 even in a context where the 
exogenous variable U = (1,1), so MD would be 1. 


It may seem somewhat circular to use causal models, which clearly already encode causal 
relationships, to define causality. There is some validity to this concern. After all, how do we 
determine whether a particular equation holds? We might believe that dropping a lit match 
results in a forest burning in part because of our experience with lit matches and dry wood, 
and thus believe that a causal relationship holds between the lit match and the dry wood. 
We might also have a general theory of which this is a particular outcome, but there too, 
roughly speaking, the theory is being understood causally. Nevertheless, I would claim that 
this definition is useful. In many examples, there is general agreement as to the appropriate 
causal model. The structural equations do not express actual causality; rather, they express 
the effects of interventions or, more generally, of variables taking on values other than their 
actual values. 


Of course, there may be uncertainty about the effects of interventions, just as there may be 
uncertainty about the true setting of the values of the exogenous variables in a causal model. 
For example, we may be uncertain about whether smoking causes cancer (this represents un- 
certainty about the causal model), uncertain about whether a particular patient Sam actually 
smoked (this is uncertainty about the value of the exogenous variables that determine whether 
Sam smokes), and uncertain about whether Sam’s brother Jack, who did smoke and got can- 
cer, would have gotten cancer had he not smoked (this is uncertainty about the effect of an 
intervention, which amounts to uncertainty about the values of exogenous variables and pos- 
sibly the equations). All this uncertainty can be described by putting a probability on causal 
models and on the values of the exogenous variables. We can then talk about the probability 
that A is a cause of B. (See Section 2.5 for further discussion of this point.) 


Using (the equations in) a causal model, we can determine whether a variable Y is 
(in)dependent of variable X. Y depends on X if there is some setting of all the variables 
in U¢ U Y other than X and Y such that varying the value of X in that setting results in a 
variation in the value of Y;; that is, there is a setting 7 of the variables other than X and Y and 
values x and x’ of X such that Fy (x, 7) 4 Fy (a’, Z). (The 7 represents the values of all the 
other variables. This vector notation is a useful shorthand that is used throughout the book; I 
say more about it for readers unfamiliar with it at the end of the section.) Note that by “de- 
pendency” here I mean something closer to “immediate dependency” or “direct dependency”. 
This notion of dependency is not transitive; that is, if X; depends on Xz and X2 depends on 
X3, then it is not necessarily the case that X; depends on X3. If Y does not depend on X, 
then Y is independent of X. 


Whether Y depends on X may depend in part on the variables in the language. In the 
original description of the voting scenario, we had just twelve variables, and W depended on 
each of V;,..., Vi,;. However, suppose that the votes are tabulated by a machine. We can add 
a variable T' that describes that final tabulation of the machine, where T' can have any value 
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of the form (t,,t2), where ¢; represents the number of votes for Billy and tz represents the 
number of votes for Suzy (so that ¢; and tz are non-negative integers whose sum is 11). Now 
W depends only on T,, while T’ depends on V,,..., Viz. 


In general, every variable can depend on every other variable. But in most interesting sit- 
uations, each variable depends on relatively few other variables. The dependencies between 
variables in a causal model MV can be described using a causal network (sometimes called 
a causal graph), consisting of nodes and directed edges. I often omit the exogenous vari- 
ables from the graph. Thus, both Figure 2.1(a) (where the exogenous variable is included) 
and Figure 2.1(b) (where it is omitted) would be used to describe the forest-fire example. 
Figure 2.1(c) is yet another description of the forest-fire example, this time replacing the sin- 
gle exogenous variable U by an exogenous variable U, that determines L and an exogenous 
variable Uz that determines MD. Again, Figure 2.1(b) is the causal graph that results when 
the exogenous variables in Figure 2.1(c) are omitted. Since all the “action” happens with the 
endogenous variables, Figure 2.1(b) really gives us all the information we need to analyze this 
example. 


U Ui U2 


FF FF FF 
(a) (b) (c) 


Figure 2.1: A graphical representation of structural equations. 


The fact that there is a directed edge from U to both L and MD in Figure 2.1(a) (with the 
direction marked by the arrow) says that the value of the exogenous variable U affects the 
value of L and MD, but nothing else affects it. The directed edges from L and MD to FF 
say that only the values of MD and L directly affect the value of FF’. (The value of U also 
affects the value of FF’, but it does so indirectly, through its effect on L and MD.) 

Causal networks convey only the qualitative pattern of dependence; they do not tell us 
how a variable depends on others. For example, the same causal network would be used for 
both the conjunctive and disjunctive models of the forest-fire example. Nevertheless, causal 
networks are useful representations of causal models. 

I will be particularly interested in causal models where there are no circular dependencies. 
For example, it is not the case that X depends on Y and Y depends X or, more generally, 
that X, depends on X2, which depends on X3, which depends on X4, which depends on Xj. 
Informally, a model is said to be recursive (or acyclic) if there are no such dependency cycles 
(sometimes called feedback cycles) among the variables. 
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The intuition that the values of exogenous variables are determined by factors outside the 
model is formalized by requiring that an exogenous variable does not depend on any other 
variables. (Actually, nothing in the following discussion changes if we allow an exogenous 
variable to depend on other exogenous variables. All that matters is that an exogenous vari- 
able does not depend on any endogenous variables.) In a recursive model, we can say more 
about how the values of endogenous are determined. Specifically, the values of some of the 
endogenous variables depend only on the values of the exogenous variables. This is the case 
for the variables LZ and MD in the forest-fire example depicted in Figure 2.1. Think of these as 
the “first-level” endogenous variables. The values of the “second-level” endogenous variables 
depend only on the values of exogenous variables and the values of first-level endogenous 
variables. The variable F'F’ is a second-level variable in this sense; its value is determined by 
that of L and MD, and these are first-level variables. We can then define third-level variables, 
fourth-level variables, and so on. 


Actually, the intuition I have just given is for what I call a strongly recursive model. It is 
easy to see that in a strongly recursive model, the values of all variables are determined given 
a context, that is, a setting @ for the exogenous variables in U/. Given tv, we can determine 
the values of the first-level variables using the equations; we can then determine the values 
of the second-level variables (whose values depend only on the context and the values of the 
first-level variables), then the third-level variables, and so on. The notion of a recursive model 
generalizes the definition of a strongly recursive model (every strongly recursive model is 
recursive, but the converse is not necessarily true) while still keeping this key feature of having 
the context determine the values of all the endogenous variables. In a recursive model, the 
partition of endogenous variables into first-level variables, second-level variables, and so on 
can depend on the context; in different contexts, the partition might be different. Moreover, 
in a recursive model, our definition guarantees that causality is asymmetric: it cannot be the 
case that A is a cause of B and B is acause of A if A and B are distinct. (See the discussion 
at the end of Section 2.7.) 


The next four paragraphs provide a formal definition of recursive models. They can be 
skipped on a first reading. 


Say that a model M is strongly recursive (or acyclic) if there is some partial order < on 
the endogenous variables in MM (the ones in VY) such that unless X = Y, Y is not affected 
by X. Roughly speaking, X =< Y denotes that X affects Y (think of “affects” here as the 
transitive closure of the direct dependency relation discussed earlier; X affects Y if there is 
some chain X1,...,X, such that X¥ = X1, Y = Xx, and X;+, directly depends on X; for 
i=1,...,k—1. The fact that = is a partial order means that ~ is a reflexive, anti-symmetric, 
and transitive relation. Reflexivity means that X =< X for each variable X—X affects X; 
anti-symmetry means that if X x Y and Y = X, then X = Y—if X affects Y and Y affects 
X, then we must have X = Y; finally, transitivity means that if X < Y and Y ~ Z, then 
X x Z—if X affects Y and Y affects 7, then X affects Z. Since ~ is partial, it may be the 
case that for some variables X and Y, neither X < Y nor Y = X holds; that is, X does not 
affect Y and Y does not affect X. 


While reflexivity and transitivity seem to be natural properties of the “affects” relation, 
anti-symmetry is a nontrivial assumption. The fact that < is anti-symmetric and transitive 
means that there is no cycle of dependence between a collection X),...,X,, of variables. It 
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cannot be the case that X, affects X2, X2 affects X3,..., X,_1 affects X,,, and X,, affects 
X,. For then we would have X; = Xo, Xo x X3,..., Xn—1 X Xn, and X, < X1. By tran- 
sitivity, we would have X2 x Xj, violating anti-symmetry. A causal network corresponding 
to a causal model where there is no such cyclic dependence between the variables is acyclic. 
That is, there is no sequence of directed edges that both starts and ends at the same node. 


A model M is recursive if, for each context (setting of the exogenous variables) w, there is 
a partial order x of the endogenous variables such that unless X =, Y, Y is independent of 
X in (M,%z), where Y is independent of X in (MM, i) if, for all settings 7 of the endogenous 
variables other than X and Y, and all values x and 2’ of X, Fy (a, 7, ti) = Fy(a’, Z, a). If M 
is a strongly recursive model, then we can assume that all the partial orders x; are the same; 
in a recursive model, they may differ. Example 2.3.3 below shows why it is useful to consider 
the more general notion of recursive model. 


As I said, if M is a recursive causal model, then given a context 7, there is a unique 
solution for all the equations. We simply solve for the variables in the order given by <j 
(where X <z Y if X xz Y and X # Y). The value of the variables that come first in the 
order, that is, the variables X such that there is no variable Y such that Y <7 X, depend 
only on the exogenous variables, so their value is immediately determined by the values of 
the exogenous variables. The values of variables later in the order can be determined from the 
equations once we have determined the values of all the variables earlier in the order. 


With the definition of recursive model under our belt, we can get back to the issue of 
choosing the “right” model. There are many nontrivial decisions to be made when choosing 
the causal model to describe a given situation. One significant decision is the set of variables 
used. As we shall see, the events that can be causes and those that can be caused are expressed 
in terms of these variables, as are all the intermediate events. The choice of variables essen- 
tially determines the “language” of the discussion; new events cannot be created on the fly, so 
to speak. In our running forest-fire example, the fact that there is no variable for unattended 
campfires means that the model does not allow us to consider unattended campfires as a cause 
of the forest fire. 


Once the set of variables is chosen, the next step is to decide which are exogenous and 
which are endogenous. As I said earlier, the exogenous variables to some extent encode the 
background situation that we want to take as given. Other implicit background assumptions 
are encoded in the structural equations themselves. Suppose that we are trying to decide 
whether a lightning bolt or a match was the cause of the forest fire, and we want to take as 
given that there is sufficient oxygen in the air and the wood is dry. We could model the dryness 
of the wood by an exogenous variable D with values 0 (the wood is wet) and 1 (the wood is 
dry). (Of course, in practice, we may want to allow D to have more values, indicating the 
degree of dryness of the wood, but that level of complexity is unnecessary for the points I 
am trying to make here.) By making D exogenous, its value is assumed to be determined by 
external factors that are not being modeled. We could also take the amount of oxygen to be 
described by an exogenous variable (e.g., there could be a variable O with two values—0O, for 
insufficient oxygen; and 1, for sufficient oxygen); alternatively, we could choose not to model 
oxygen explicitly at all. For example, suppose that we have, as before, a variable MD (match 
dropped by arsonist) and another variable WB (wood burning), with values 0 (it’s not) and 1 
(it is). The structural equation F'jg would describe the dependence of WB on D and MD. 


Downloaded from http://direct.mit.edu/books/book-pdf/2262849/book_9780262336611.pdf by guest on 04 April 2024 


2.1 Causal Models 19 


By setting Fwe(1,1) = 1, we are saying that the wood will burn if the lit match is dropped 
and the wood is dry. Thus, the equation is implicitly modeling our assumption that there is 
sufficient oxygen for the wood to burn. Whether the modeler should include O in the model 
depends on whether the modeler wants to consider contexts where the value of O is 0. If in 
all contexts relevant to the modeler there is sufficient oxygen, there is no point in cluttering 
up the model by adding the variable O (although adding it will not affect anything). 


According to the definition of causality in Section 2.2, only endogenous variables can be 
causes or be caused. Thus, if no variables encode the presence of oxygen, or if it is encoded 
only in an exogenous variable, then oxygen cannot be a cause of the forest burning in that 
model. If we were to explicitly model the amount of oxygen in the air (which certainly might 
be relevant if we were analyzing fires on Mount Everest), then f'wp would also take values 
of O as an argument, and the presence of sufficient oxygen might well be a cause of the wood 
burning, and hence the forest burning. Interestingly, in the law, there is a distinction between 
what are called conditions and causes. Under typical circumstances, the presence of oxygen 
would be considered a condition, and would thus not count as a cause of the forest burn- 
ing, whereas the lightning would. While the of oxygen would be considered a condition and 
would thus not count as a cause of the forest burning, while the lightning would. Although 
the distinction is considered important, it does not seem to have been carefully formalized. 
One way of understanding it is in terms of exogenous versus endogenous variables: condi- 
tions are exogenous, (potential) causes are endogenous. I discuss an alternative approach to 
understanding the distinction, in terms of theories of normality, in Section 3.2. 


It is not always straightforward to decide what the “right” causal model is in a given sit- 
uation. What is the “right” set of variables to use? Which should be endogenous and which 
should be exogenous? As we shall see, different choices of endogenous variables can lead to 
different conclusions, although they seem to be describing exactly the same situation. And 
even after we have chosen the variables, how do we determine the equations that relate them? 
Given two causal models, how do we decide which is better? 


These are all important questions. For now, however, I will ignore them. The definition of 
causality I give in the next section is relative to a model M and context i. A may be a cause 
of B relative to (IM, i) and not a cause of B relative to (’, ui’). Thus, the choice of model 
can have a significant impact in determining causality ascriptions. The definition is agnostic 
regarding the choice of model; it has nothing to say about whether the choice of variables or 
the structural equations used are in some sense “reasonable”. Of course, people may legiti- 
mately disagree about how well a particular causal model describes the world. Although the 
formalism presented here does not provide techniques to settle disputes about which causal 
model is the right one, at least it provides tools for carefully describing the differences be- 
tween causal models, so that it should lead to more informed and principled decisions about 
those choices. That said, I do return to some of the issues raised earlier when discussing the 
examples in this section, and I discuss them in more detail in Chapter 4. 

As promised, I conclude this section with a little more discussion of the vector notation, for 
readers unfamiliar with it. I use this notation throughout the book to denote sets of variables 
or their values. For example, if there are three exogenous variables, say U;, U2, and U3, and 
U, = 0, Uz = 0, and U3 = 1, then I write U = (0,0, 1) as an abbreviation of U; = 0, 
Uz = 0, and U3 = 1. This vector notation is also used to describe the values 2 of a collection 
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X of endogenous variables. I deliberately abuse notation at times and view U both as a set 
of variables and as a sequence of variables. For example, when I write U = (0,0, 1), then I 
am thinking of U as the sequence (U;, Uz, U3), and this equality means that U; = 0, Uz = 0, 
and U3 = 1. However, in expressions such as U2 € U and U @ U’, U should be understood 
as the corresponding unordered set {U;, U2, U3}. I hope that the intent is always clear from 
context. For the most part, readers who do not want to delve into details can just ignore the 
vector arrows. 


2.2. A Formal Definition of Actual Cause 


In this section, I give the HP definition of causality and show how it works in a number of 
examples. But before giving the definition, I have to define a formal language for describing 
causes. 


2.2.1 A language for describing causality 


To make the definition of actual causality precise, it is helpful to have a formal language for 
making statements about causality. The language is an extension of propositional logic, where 
the primitive events (i.e., primitive propositions) have the form X = z for an endogenous 
variable X and some possible value x of X. The primitive event MD = 0 says “the lit match is 
not dropped”; similarly, the primitive event L = 1 says “lightning occurred”. These primitive 
events can be combined using standard propositional connectives, such as \, V, and —. Thus, 
the formula MD = OV L = 1 says “either the lit match is not dropped or lightning occurred”, 
MD =0A L = 1 says “the lit match is not dropped and lightning occurred”, and =(L = 1) 
says “lightning did not occur” (which is equivalent to L = 0, given that the only possible 
values of L are 0 or 1). A Boolean combination of primitive events is a formula obtained by 
combining primitive events using A, V, and —. For example, -(MD =0VL=1)AWB=1 
is a Boolean combination of the primitive events MD = 0, L = 1, and WB = 1. The key 
novel feature in the language is that we can talk about interventions. A formula of the form 
[Y < g|(X = 2) says that after intervening to set the variables in Y to y, X takes on the 
value x. 

(A short aside: I am thinking of an event here as a subset of a state space; this is the standard 
usage in computer science and probability. Once I give semantics to causal formulas, it will 
be clear that we can associate a causal formula y with a set of contexts—the set of context 
where ¢ is true—so it can indeed be identified with an event in this sense. In the philosophy 
literature, the use of the word “event” is somewhat different, and what counts as an event is 
contentious; see the notes at the end of this chapter and at the end of Chapter 4.) 

Now for the formal details. Given a signature S = (U,V, R), a primitive event is a formula 
of the form X = x, for X € V and x € R(X). A causal formula (over S) is one of the form 
[Yi — y1,---, Ye < ye|y, where 


= 1 is a Boolean combination of primitive events, 


» Y,,...,Y; are distinct variables in V, and 
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"yi € R(X). 


Such a formula is abbreviated as [Y + yy, using the vector notation. The special case where 
k, = O is abbreviated as []y or, more often, just vy. Intuitively, [Yi — y1,..., Ye < yrly says 
that y~ would hold if Y; were set to y;, for? = 1,...,k. It is worth emphasizing that the 
commas in [Y; + y1,.--,Yx < yxly are really acting like conjunctions; this says that if Y; 
were set to y; and ...and Y;, were set to y;, then y would hold. I follow convention and use 
the comma rather than /\. I write Y; <— y; rather than Y; = y; to emphasize that here Y; is 
assigned the value y,, as the result of an intervention. For S = (U,V, R), let £(S) consist 
of all Boolean combinations of causal formulas, where the variables in the formulas are taken 
from Y and the sets of possible values of these variables are determined by R. 

The language above lets us talk about interventions. What we need next is a way of defining 
the semantics of causal formulas: that is, a way of determining when a causal formula is 
true. A causal formula ~ is true or false in a causal model, given a context. I call a pair 
(M, v) consisting of a causal model 1 and context w a causal setting. I write (M,t) - w 
if the causal formula ~ is true in the causal setting (MW, i). For now, I restrict attention to 
recursive models, where, given a context, there are no cyclic dependencies. In a recursive 
model, (M,i) —: X = « if the value of X is x once we set the exogenous variables to w. 
Here I am using the fact that, in a recursive model, the values of all the endogenous variables 
are determined by the context. If ~ is an arbitrary Boolean combination of primitive events, 
then whether (1/, %) | w can be determined by the usual rules of propositional logic. For 
example, (M,iti) EF X =aVY = yifeither (M,u%) EF X =xor(M,t) EY =y. 

Using causal models makes it easy to give the semantics of formulas of the form 
[Y < gj(X = 2) and, more generally, [Y < gj. Recall that the latter formula says that, 
after intervening to set the variables in Y to y, w holds. Given a model M, the model that 


describes the result of this intervention is My eg Thus, 


(M, i) & [Y < gly iff (My, 5a) FY. 


The mathematics just formalizes the intuition that the formula [Y + y]w is true in the causal 
setting (JV, v) exactly if the formula 7 is true in the model that results from the intervention, 
in the same context w. 

For example, if MW 4 is the disjunctive model for the forest fire described earlier, then 
(M7,(1,1)) E [MD « O|(FF = 1): even if the arsonist is somehow prevented from 
dropping the match, the forest burns (thanks to the lightning); that is, (M¢,p.9,(1,1)) E 
FF = 1. Similarly, (M%,(1,1)) [L «+ O|(FF = 1). However, (M%,(1,1)) - 
[L + 0; MD + 0|(FF = 0): if the arsonist does not drop the lit match and the lightning 
does not strike, then the forest does not burn. 

The notation (7,t) | y is standard in the logic and philosophy communities. It is, 
unfortunately, not at all standard in other communities, such as statistics and economet- 
rics. Although notation is not completely standard in these communities, the general ap- 
proach has been to suppress the model M/ and make the context wu an argument of the en- 


l 


dogenous variables. Thus, for example, instead of writing (/,i) / X = 2, in these 
communities they would write X(t) = x. (Sometimes the exogenous variables are also 
suppressed, or taken as given, so just X = 2 is written.) More interestingly, instead of 
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(M,iti) — [X — a]|(Y =y), in these communities they would write Y,(u%) = y. Al 
though the latter notation, which I henceforth call statisticians’ notation (although it is more 
widespread), is clearly more compact, the compactness occasionally comes at the price of 
clarity. For example, in the disjunctive forest-fire example, what does F'F'o(1,1) = 1 mean? 
Is it (M4, (1,1)) K [MD ¢ 0](FF = 1) or (M4, (1,1)) K [LZ < 0|(FF = 1)? This prob- 
lem can be solved by adding the variable being intervened on to the subscript if it is necessary 
to remove ambiguity, writing, for example, FF';,-0(1, 1) = 0. Things get more complicated 
when we want to write (IW, wu) KE [X < aly, where y is a Boolean combination of primitive 
events, or when there are several causal models in the picture and it is important to keep track 
of them. That said, there are many situations when writing something like Y,(w) = y is com- 
pletely unambiguous, and it is certainly more compact. To make things easier for those used 
to this notation, I translate various statements into statisticians’ notation, particularly those 
where it leads to more compact formulations. This will hopefully have the added advantage 
of making both notations familiar to all communities. 


2.2.2 The HP definition of actual causality 


I now give the HP definition of actual causality. 

The types of events that the HP definition allows as actual causes are ones of the form 
X, =2,/A...A\ Xp = £p—1that is, conjunctions of primitive events; this is often abbreviated 
as X = i. The events that can be caused are arbitrary Boolean combinations of primitive 
events. The definition does not allow statements of the form “A or A’ is a cause of B,” 
although this could be treated as being equivalent to “either A is a cause of B or A’ is a cause 
of B”. However, statements such as “A is a cause of either B or B’” are allowed; this is not 
equivalent to “either A is a cause of B or A is a cause of B’”. 

Note that this means that the relata of causality (what can be a cause and what can be 
caused) depend on the language. It is also worth remarking that there is a great deal of debate 
in the philosophical literature about the relata of causality. Although it is standard in that 
literature to take the relata of causality to be events, exactly what counts as an event is also a 
matter of great debate. I do not delve further into these issues here; see the notes at the end of 
the chapter for references and a little more discussion. 

Although I have been talking up to now about “the” HP definition of causality, I will ac- 
tually consider three definitions, all with the same basic structure. Judea Pearl and I started 
with a relatively simple definition that gradually became more complicated as we discovered 
examples that our various attempts could not handle. One of the examples was discovered 
after our definition was published in a conference paper. So we updated the definition for the 
journal version of the paper. Later work showed that the problem that caused us to change the 
definition was not as serious as originally thought. However, recently I have considered a def- 
inition that is much simpler than either of the two previous versions, which has the additional 
merit of dealing better with a number of problems. It is not clear that the third definition is 
the last word. Moreover, showing how the earlier definitions deal with the various problems 
that have been posed lends further insight into the difficulties and subtleties involved with 
finding a definition of causality. Thus, I consider all three definitions in this book. Following 
Einstein, my goal is to make things as simple as possible, but no simpler! 
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Like the definition of truth on which it depends, the HP definition of causality is relative to 
a causal setting. All three variants of the definition consist of three clauses. The first and third 
are simple and straightforward, and the same for all the definitions. All the work is done by 
the second clause; this is where the definitions differ. 


Definition 2.2.1. X = Z is an actual cause of ¢ in the causal setting (M, it) if the following 
three conditions hold: 


ACI. (M,@) K (X = #) and (M,i@) Ey. 


AC2. See below. 


AC3. X is minimal; there is no strict subset X’ of X such that X’ = 2” satisfies conditions 
AC1 and AC2, where 2” is the restriction of # to the variables in X. | 


ACI just says that X = cannot be considered a cause of y unless both X = #and yY 
actually happen. (I am implicitly taking (M, t/) to characterize the “actual world” here.) AC3 
is a minimality condition, which ensures that only those elements of the conjunction X=f 
that are essential are considered part of a cause; inessential elements are pruned. Without 
AC3, if dropping a lit match qualified as a cause of the forest fire, then dropping a match and 
sneezing would also pass the tests of AC] and AC2. AC3 serves here to strip “sneezing” and 
other irrelevant, over-specific details from the cause. AC3 can be viewed as capturing part of 
the spirit of the INUS condition; roughly speaking, it says that all the primitive events in the 
cause are necessary for the effect. 

It is now time to bite the bullet and look at AC2. In the first two versions of the HP 
definition, AC2 consists of two parts. I start by considering the first of them, denoted AC2(a). 
AC2(a) is anecessity condition. It says that for X = x to be acause of y, there must be a value 
x’ in the range of X such that if X is set to x’, y no longer holds. This is the but-for clause; 
but for the fact that_X = x occurred, y would not have occurred. As we saw in the Billy-Suzy 
rock-throwing example in Chapter 1, the naive but-for clause will not suffice. We must be 
allowed to apply it under certain contingencies, that is, under certain counterfactual settings, 
where some variables are set to values other than those they take in the actual situation or held 
to certain values while other variables change. For example, in the case of Suzy and Billy, we 
consider a contingency where Billy does not throw. 

Here is the formal version of the necessity condition: 


AC2(a). There i isa partition of Y (the set of "endogenous bcc into two disjoint subsets 
Z and W (so that ZAW = 0) with x C Zanda setting 7’ and w of the variables in 
X and W, respectively, such that 


(Mt) FE [X + 2,W<ew-y. 


Using statisticians’ notation, if vy is Y= y, then the displayed formula becomes Y3,z(t/) 4 ¥. 

Roughly speaking, AC2(a) says that the but-for condition holds under the contingency 
W = w. We can think of the variables in Z as making up the “causal path” from X to Y. 
Intuitively, changing the value of some variable in X results in changing the value(s) of some 
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variable(s) in Z. which results in the values of some other variable(s) in vA being changed, 
which finally results in the truth value of y changing. (Whether we can think of the variables 
in Z as making up a causal path from some variable in X to some variable in y turns out 
to depend in part on which variant of the HP definition we consider. See Section 2.9 for a 
formalization and more discussion.) The remaining endogenous variables, the ones in W, are 
off to the side, so to speak, but may still have an indirect effect on what happens. 

Unfortunately, AC1, AC2(a), and AC3 do not suffice for a good definition of causality. In 
the rock-throwing example, with just ACI, AC2(a), and AC3, Billy would be a cause of the 
bottle shattering. We need a sufficiency condition to block Billy. The sufficiency condition 
used in the original definition, roughly speaking, requires that if the variables in X and an 
arbitrary subset Z of other variables on the causal path (i.e., besides those in x ) are held at 
their values in the actual context (where the value of a atiable Y in the actual context is the 
value y* such that (M, uv) / Y = y*; the values of the variables in X in the actual context is 
£, by AC1), then vy holds even if W is set to w (the setting for W used in AC2(a)). This is 
formalized by the following condition: 


AC2(b°). If Z* is such that (M,Z) KE Z = 2*, then for all subsets Z’ of Z — X, we have 


(M,@) E[X H2,W¢G,2' + Bo. 


Again, taking y to be Y= Ys in statisticians’ notation this becomes “if Z (a) = 2*, then for 


all subsets Z’ of Z, we have ea We, gree (U) = y”” It is important to write at least the 


z explicitly in the subscript here, since we are quantifying over it. 

Note that, due to setting W to w, the values of the variables in Z may change. AC2(b°) 
says that this change does not affect y~; y continues to be true. Indeed, vy continues to be true 
even if some variables in Z are forced to their original values. 

Before going on, I need to make a brief technical digression to explain a slight abuse of 
notation in AC2(b°). Suppose that Z = (Z,, Z2), Z = (1,0), and Z’ = (Z,). Then Z/ + Z 
is intended to be an abbreviation for 7; + 1; that is, I am ignoring the second component of 
Zhere. More generally, when I write Z' — Z1am picking out the values in 7 that correspond 
to the variables in Z Z' and i ignoring those that correspond to the variables in ZF] similarly 
write W’ < wif W’ is a subset of W. 

For reasons that will become clearer in Example 2.8.1, the original definition was updated 
to use a stronger version of AC2(b°). Sufficiency i is now required to hold if the variables in 
any subset W’ of W are set to the values in w (as well as the variables in any subset Z Z' of 
Ga being set to their values in the actual context). Taking Z and W as in AC2(a) (and 
AC2(b?)), here is the updated definition: 


AC2(b"). If Z* is such that (M, i) K Z = 2*, then for all subsets W’ of W and subsets Z’ 
of Z — _X, we have (M,@) EK [X + 2,W' Cw, 7’ & BI. 


/ 


In statisticians’ notation, this becomes “if Z(i) = 2*, then for all subsets Z’ of Z and W’ 
of W, we have Yes Wea, dex (u) = y-’ Again, we need to write the variables in the 
subscript here. 

The only difference between AC2(b°) and AC2(b”) lies in the clause “for all subsets w' 
of W”: AC2(b”) must hold even if only a subset W’ of the variables in W are set to their 
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values in w. This means that the variables in W — W’ essentially act as they do in the real 
world; that is, their values are determined by the structural equations, rather than being set to 
their values in w. 

The superscripts o and u in AC2(b°) and AC2(b“) are intended to denote “original” and 
“updated”. 

I conclude by considering the simpler modified definition. This definition is motivated by 
the intuition that only what happens in the actual situation should matter. Thus, the only 
settings of variables allowed are ones that occur in the actual situation. Specifically, the 
modified definition simplifies AC2(a) by requiring that the only setting w of the variables 
in W that can be considered is the value of these variables in the actual context. Here is the 
modified AC2(a), denoted AC2(a”™) (the m stands for “modified”): 


AC2(a™). There is a set W of variables in V anda setting z’ of the variables in X such that 
if (M,v) E W = w", then 


(M,@) E[X — 2,W ew -7y. 


That is, we can show the counterfactual dependence of yp on x by keeping the variables in Ww 
fixed at their actual values. In aaa notation (with y being Y= y), this becomes “if 
W (a) = w*, then Yzg«(@) AT 

It is easy to see that AC2(b°) h holds if all the variables in W are fixed at their values in the 
actual context: because 7* records the value of the variables in W in the actual context, if 
X is (re)set to #, its value in the actual context, then y must hold (since, by AC1, X = Zand 
y both hold in the actual context); that is, we have (M,i) ] [X — 7#,W < w* |p if w* is 
the value of the variables in W in the actual context. For similar reasons, AC2(b”) holds if 
all the variables in W are fixed at their values in the actual context. Thus, there is no need 
for an analogue to AC2(b°) or AC2(b”) in the modified definition. This shows that the need 
for a sufficiency condition arises only if we are considering contingencies that differ from the 
actual setting in AC2(a). It is also worth noting that the modified definition does not need to 
mention Z (although Z can be taken to be the complement of W). 

For future reference, for all variants of the HP definition, the tuple (Ww, 2”) in AC2 is 
said to be a witness to the fact that X = Z is a cause of y. (I take the witness to be (0,0, 2”) 
in the special case that W = 0.) 

As I said, AC3 is a minimality condition. Technically, just as there are three versions of 
AC2, there are three corresponding versions of AC3. For example, in the case of the modified 
definition, AC3 should really say “there is no subset of x satisfying AC1 and AC2(a™)”. I 
will not bother writing out these versions of AC3; I hope that the intent is clear whenever I 
refer to AC3. 

If I need to specify which variant of the HP definition I am considering, I will say that 
X = fis a cause of p according to the original (resp., updated, modified) HP definition. If 
X = Zisacause of according to all three variants, I often just say “X = iis acause of 
y in (M, a)”. Each conjunct in X = Z is called part of a cause of y in context (M, i). As 
we shall see, what we think of as causes in natural language correspond to parts of causes, 
especially with the modified HP definition. Indeed, it may be better to use a term such as 
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“complete cause” for what I have been calling cause and then reserve “cause” for what I have 
called “part of a cause”. I return to this point after looking at a few examples. 

Actually, although X = Zisan abbreviation for the conjunction X; = 47 /A...AX, = “x, 
when we talk about X = z being a cause of y (particularly for the modified HP definition), it 
might in some ways be better to think of the disjunction. Roughly speaking, X = Zisacause 
of vy if it is the case that if none of the variables in X had their actual value, then y might not 
have occurred. However, if X has more than one variable, then just changing the value of a 
single variable in x (or, indeed, any strict subset of the variables in X) is not enough by itself 
to bring about —y (given the context). Thus, there is a sense in which the disjunction can be 
viewed as a but-for cause of yy. This will be clearer when we see the examples. 

Note that all three variants of the HP definition declare X = x a cause of itself in (M, w) 
as long as (M,u) |: X = «x. This does not seem to me so unreasonable. At any rate, it 
seems fairly harmless, so I do not bother excluding it (although nothing would change if we 
did exclude it). 

At this point, ideally, I would prove a theorem showing that some variant of the HP def- 
inition of actual causality is the “right” definition of actual causality. But I know of no way 
to argue convincingly that a definition is the “right” one; the best we can hope to do is to 
show that it is useful. As a first step, I show that all the definitions agree in the simplest and 
arguably most common case: but-for causes. Formally, say that X = «x is a but-for cause of 
yp in (M, t) if AC1 holds (so that (1, w) | (X = x) A y) and there exists some x’ such that 
(M, i) - [X < 2’]-y. Note here I am assuming that the cause is a single conjunct. 


Proposition 2.2.2 If X = x is a but-for cause of Y = y in (M, t), then X = x is a cause of 
Y = y according to all three variants of the HP definition. 


Proof: Suppose that X = «x is a but-for cause of Y = y. There must be a possible value 
a! of X such that (M,i) K [X © 2’]-y. Then (0,0, 2’) (ie, W = 0 and X = 2')isa 
witness to X = x being a cause of ¢ for all three variants of the definition. Thus, i ioe 
and AC2(a™) hold if we take W = @). Since (M, i) K X = 2, if (M,i) K Z = 2*, where 
Z = V —{X}, then it is easy to see that (M, i) K [X < 2x](Z = 2*): since x= = x and 
Z = Z* inthe (unique) solution to the equations in context i, this is still the case in the unique 
solution to the equations in Mx_,. For similar reasons, (M, i) = [X < #,Z’ — 2*|y for 
all subsets Z’ of Y—{X}. (See Lemma 2.10.2, where these statements are formalized.) Thus, 
AC2(b°) holds. Because W =0, AC2(b") follows immediately from AC2(b?). If 


Of course, the definitions do not always agree. Other connections between the definitions 
are given in the following theorem. In the statement of the theorem, the notation |X| denotes 
the cardinality of X. 
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Theorem 2.2.3 


(a) If X = xis part of a cause of p in (M, %) according to the modified HP definition, then 
X = x is a cause of y in (M, t%) according to the original HP definition. 


(b) If X = xis part of a cause of p in (M, t) according to the modified HP definition, then 
X = wx is a cause of y in (M, t%) according to the updated HP definition. 


(c) If X = x is part of a cause of vp in (M, t) according to the updated HP definition, then 
X = x is a cause of y in (M, %) according to the original HP definition. 


(d) If X = is acause of ¢ in (M, i) according to the original HP definition, then |X| = 1 
(i.e., X is a singleton). 


Of course, (a) follows from (b) and (c); I state it separately just to emphasize the relations. 
Although it may not look like it at first glance, part (d) is quite similar in spirit to parts (a), 
(b), and (c); an equivalent reformulation of (d) is: “If X = «a is part of a cause of y in (M, tv) 
according to the original HP definition, then X = z is a cause of y in (M, i/) according to 
the original HP definition.” Call this statement (d’). Clearly (d) implies (d’). For the converse, 
suppose that (d’) holds, X = is a cause of y in (M, a), and |X| > 1. Thenif X = risa 
conjunct of X = &, by (d’), X = x is acause of y in (M, a). But by AC3, this can hold only 
if X = x is the only conjunct of X = Z, so |X| = 1. 

Parts (a), (b), and (c) of Theorem 2.2.3 show that, in a sense, the original HP definition is 
the most permissive of the three and the modified HP definition is the most restrictive, with 
the updated HP definition lying somewhere in between. The converses of (a), (b), and (c) 
do not hold. As Example 2.8.1 shows, if X = x is a cause of y according to the original 
HP definition, it is not, in general, part of a cause according to the modified or updated HP 
definition. In this example, we actually do not want causality to hold, so the example can 
be viewed as showing that the original definition is too permissive (although this claim is 
not quite as clear as it may appear at first; see the discussion in Section 2.8). Example 2.8.2 
shows that part of a cause according to the updated HP definition need not be (part of) a cause 
according to the modified HP definition. 

The bottom line here is that, although these definitions often agree, they do not always 
agree. It is typically on the most problematic examples on which they disagree. This is 
perhaps not surprising given that the original definition was updated and modified precisely 
to handle such examples. However, even in cases where the original definition seems to 
give “wrong” answers, there are often ways of dealing with the problem that do not require 
modifying the definition. Moreover, the fact that causes are always singletons with the original 
definition makes it attractive in some ways. The jury is still out on what the “right” definition 
of causality is. Although my current preference is the modified HP definition, I will consider 
all three definitions throughout the book. Despite the extra burden for the reader (for which I 
apologize in advance!), I think doing so gives a deeper insight into the subtleties of causality. 

The proof of Theorem 2.2.3 can be found in Section 2.10.1. recommend that the interested 
reader defer reading the proof until after going through the examples in the next section, since 
they provide more intuition for the definitions. 
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2.3. Examples 


Because I cannot prove a theorem showing that (some variant of) the HP definition is the 
“right” definition of causality, all I can do to argue that the definition is reasonable is to 
show how it works in some of the examples that have proved difficult for other definitions to 
handle. That is one of the goals of this section. I also give examples that illustrate the subtle 
differences between the variants of the HP definition. I start with a few basic examples mainly 
intended to show how the definition works. 


Example 2.3.1 For the forest-fire example, I consider two causal models, M/° and M 4 for 
the conjunctive and disjunctive cases, respectively. These models were described earlier; the 
endogenous variables are L, MD, and FF’, and U is the only exogenous variable. In both 
cases, I want to consider the context (1,1), so the lightning strikes and the arsonist drops the 
match. In the conjunctive model, both the lightning and the dropped match are but-for causes 
of the forest fire; if either one had not occurred, the fire would not have happened. Hence, by 
Proposition 2.2.2, both L = 1 and MD = 1 are causes of FF = 1 in (M°, (1, 1)) according 
to all three variants of the definition. By AC3, it follows that L = 1 A MD = 1 is not a cause 
of FF = 1in (M*°, (1,1)). This already shows that causes might not be unique; there may be 
more than one cause of a given outcome. 

However, in the disjunctive case, there are differences. With the original and updated 
definition, again, we have that both L = 1 and MD = 1 are causes of FF = 1 in (M%,(1,1)). 
I give the argument in the case of L = 1; the argument for MD = | is identical. Clearly 
(M4,(1,1)) K FF = 1 and (M¢,(1,1)) K L = 1; in the context (1,1), the lightning 
strikes and the forest burns down. Thus, AC1 is satisfied. AC3 is trivially satisfied: since x 
consists of only one element, L, X must be minimal. 

For AC2, as suggested earlier, let Z = {L, FF}, W = {MD}, x’ = 0, and w = 0. 
Clearly, (M7, (1,1)) K [ZL — 0, MD « 0|(FF F 1); if the lightning does not strike and the 
match is not dropped, the forest does not burn down, so AC2(a) is satisfied. To see the effect 
of the lightning, we must consider the contingency where the match is not dropped; AC2(a) 
allows us to do that by setting MD to 0. (Note that setting L and MD to 0 overrides the effects 
of U; this is critical.) Moreover, 


(M4,(1,1)) K [L — 1, MD © O|(FF = 1) and (M4, (1,1)) E [L — 1](FF = 1); 


in context (1,1), if the lightning strikes, then the forest burns down even if the lit match is 
not dropped, Ne) AC2(b°) and AC2(b“) are satisfied. (Note that since Z= {L, FF}, the only 
subsets of Z—X are the empty set and the singleton set consisting of just FF’; similarly, since 
W= { MD}, the only subsets of W are the empty set and the singleton set consisting of 1/D 
itself. Thus, we have considered all the relevant cases here.) 

This argument fails in the case of the modified definition because the setting MD = 0 is 
not allowed. The only witnesses that can be considered are ones where W has the value it does 
in the actual context; in the actual context here, MD = 1. So, with the modified definition, 
neither L = 1 nor MD = 1 is acause. However, it is not hard to see that L = 1A MD=1 
is a cause of FF = 1. This shows why it is critical to consider only single conjuncts in 
Proposition 2.2.2 and Theorem 2.2.3 is worded in terms of parts of causes. Although L = 
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1A MD = lisacause of FF = 1 according to the modified HP definition, it is not a cause 
according to either the original or updated HP definition. In contrast, L = 1 and MD = 1 are 
both parts of a cause of FF = 1 according to all three definitions. 

It is arguably a feature of the original and updated HP definition that they call both L = 1 
and MD = 1 causes of FF = 1. Calling the conjunction L = 1A MD = 1 a cause of 
FF = 1 does not seem to accord with natural language usage. There are two ways to address 
this concern. As I suggested earlier, with the modified HP definition, it may be better to think 
of parts of causes as coming closer to what we call causes in natural language. That would 
already deal with this concern. A second approach, which J also hinted at earlier, is to observe 
that it may be better to think of the disjunction L = 1 MD = 1 as being the cause. Indeed, 
we can think of MD = 1\V ZL = 1 asa but-for cause of F'F' = 1; if it does not hold, then it 
must be the case both L = 0 and MD = 0, so there is no fire. The reader should keep in mind 
both of these ways of thinking of conjunctive causes with the modified HP definition. 

As we shall see, the notion of responsibility, discussed in Chapter 6, allows the original 
and updated HP definition to distinguish these two scenarios, as does the notion of sufficient 
cause, discussed in Section 2.6. JJ 


This simple example already reveals some of the power of the HP definition. The case 
where the forest fire is caused by either the lightning or the dropped match (this is said to be a 
case of overdetermination in the literature) cannot be handled by the simple but-for definition 
used in the law. It seems reasonable to call both the lightning and the dropped match causes, 
or at least parts of causes. We certainly wouldn’t want to say that there is no cause, as a 
naive application of the but-for definition would do. (However, as I noted above, if we allow 
disjunctive causes, then a case can be made that the disjunction L = 1 V MD = 1 is a but-for 
cause of the forest fire.) Overdetermination occurs frequently in legal cases; a victim may 
be shot by two people, for example. It occurs in voting as well; I look at this a little more 
carefully in the following example. 


Example 2.3.2 Consider the voting scenario discussed earlier where there are 11 voters. If 
Suzy wins 6—5, then all the definitions agree that each of the voters for Suzy is a cause of 
Suzy’s victory; indeed, they are all but-for causes. But suppose that Suzy wins 11-0. Here 
we have overdetermination. The original and updated HP definition would still call each of 
the voters a cause of Suzy winning; the witness involves switching the votes of 5 voters. 
Since the modified HP definition does not allow such switching, according to the modified 
HP definition, any subset of six voters is a cause of Suzy winning (and every individual is part 
of a cause). Again, if we think of the subset as being represented by a disjunction, it can be 
thought of as a but-for cause of Suzy winning. If all six voters had switched their votes to 
Billy, then Suzy would not have won. Minimality holds; if a set of fewer than six voters had 
voted for Billy, then Suzy would still have won. §f 


We do have an intuition that a voter for Suzy is somehow “less” of a cause of the victory 
in an 11-0 victory than in a 6-5 victory. This intuition is perhaps most compatible with the 
modified HP definition. In this case, X = Zis acause of y if X is a minimal set of variables 
whose values have to change for —y to hold. The bigger X is, the less of a cause each of its 
conjuncts is. I return to this issue when I discuss responsibility in Chapter 6. 
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Example 2.3.3 Now I (finally!) consider the rock-throwing example, where Suzy and Billy 
both throw rocks at a bottle, but Suzy’s hits the bottle, and Billy’s doesn’t (although it would 
have hit had Suzy’s not hit first). We get the desired result—that Suzy’s throw is a cause, but 
Billy’s is not—but only if we model the story appropriately. Consider first a coarse causal 
model, with three endogenous variables: 


a ST for “Suzy throws”, with values 0 (Suzy does not throw) and | (she does); 
= BT for “Billy throws”, with values 0 (he doesn’t) and | (he does); 
= BS for “bottle shatters’, with values 0 (it doesn’t shatter) and 1 (it does). 


For simplicity, assume that there is one exogenous variable u, which determines whether Billy 
and Suzy throw. (In most of the subsequent examples, I omit the exogenous variable in both 
the description of the story and the corresponding causal network.) Take the formula for BS 
to be such that the bottle shatters if either Billy or Suzy throws; that is, BS = BT V ST. (I 
am implicitly assuming that Suzy and Billy never miss if they throw. Also, I follow the fairly 
standard mathematics convention of saying “if” rather than “if and only if” in definitions; 
clearly, the equation is saying that the bottle shatters if and only if either Billy or Suzy throws.) 
For future reference, I call this model M pr (where RT stands for “rock throwing’’). 

BT and ST play symmetric roles in M pz; there is nothing to distinguish them. Not sur- 
prisingly, both Billy’s throw and Suzy’s throw are classified as causes of the bottle shattering 
in M pr in the context u where Suzy and Billy both throw according to the original and up- 
dated HP definition (the conjunction ST’ = 1A BT = 1 is the cause according to the modified 
HP definition). The argument is essentially identical to that used for the disjunctive model of 
the forest-fire example, where either the lightning or the dropped match is enough to start the 
fire. Indeed, the causal network describing this situation looks like that in Figure 2.1, with ST 
and BT replacing L and MD. For convenience, I redraw the network here. As I said earlier, 
here and in later figures, I typically omit the exogenous variable(s). 


ST BT 


BS 


Figure 2.2: Mpr—a naive model for the rock-throwing example. 


The trouble with 7/7 is that it cannot distinguish the case where both rocks hit the bottle 
simultaneously (in which case it would be reasonable to say that both ST = 1 and BT = 1 
are causes of BS = 1) from the case where Suzy’s rock hits first. Mpr has to be refined 
to express this distinction. One way is to invoke a dynamic model. Although this can be 
done (see the notes at the end of the chapter for more discussion), a simpler way to gain 
the expressiveness needed here is to allow BS' to be three-valued, with values 0 (the bottle 
doesn’t shatter), | (it shatters as a result of being hit by Suzy’s rock), and 2 (it shatters as a 
result of being hit by Billy’s rock). I leave it to the reader to check that ST’ = 1 is a cause 
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of BS = 1, but BT = 1 is not (if Suzy doesn’t throw but Billy does, then we would have 
BS = 2). To some extent, this solves our problem. But it borders on cheating; the answer is 
almost programmed into the model by invoking the relation “as a result of” in the definition 
of BS = 2, which requires the identification of the actual cause. 

A more useful choice is to add two new variables to the model: 


= BH for “Billy’s rock hits the (intact) bottle”, with values 0 (it doesn’t) and 1 (it does); 
and 


« SH for “Suzy’s rock hits the bottle”, again with values 0 and 1. 
We now modify the equations as follows: 

» BS =1if Sd =1or BH = 1; 

» SH =1if ST =1; 

« BH =1if BT =1and SH =0. 


Thus, Billy’s throw hits if Billy throws and Suzy’s rock doesn’t hit. The last equation implic- 
itly assumes that Suzy throws slightly ahead of Billy, or slightly harder. 

Call this model Mp. Mp is described by the graph in Figure 2.3 (where again the 
exogenous variables are omitted). The asymmetry between BH and SH (in particular, the 
fact that Billy’s throw doesn’t hit the bottle if Suzy throws) is modeled by the fact that there 
is an edge from SH to BH but not one in the other direction; BH depends (in part) on SH 
but not vice versa. 


ST BT 


SH BH 


BS 


Figure 2.3: M/,,>—a better model for the rock-throwing example. 


Taking wu to be the context where Billy and Suzy both throw, according to all three variants 
of the HP definition, ST = 1 is a cause of BS = 1 in (Mj7,u), but BT = 1 is not. To 
see that ST’ = 1 is a cause according to the original and updated HP definition, note that it 
is immediate that AC1 and AC3 hold. To see that AC2 holds, one possibility is to choose 
Z = {ST, SH, BH, BS}, W = {BT}, and w = 0. When BT is set to 0, BS tracks ST: 
if Suzy throws, the bottle shatters, and if she doesn’t throw, the bottle does not shatter. It 
immediately follows that AC2(a) and AC2(b°) hold. AC2(b”) is equivalent to AC2(b°) in this 
case, since W isa singleton. 
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To see that BT = 1 is not acause of BS = 1 in (Myr, uv), we must check that there is no 
partition of the endogenous variables into sets Z and W that satisfies AC2. Attempting the 
symmetric choice with Z = {BT, BH, SH BS}, W = {ST}, and w = 0 violates AC2(b°) 
and AC2(b“). To see this, take Z = { BH}. In the context where Suzy and Billy both throw, 
BH = 0. If BH is set to 0, then the bottle does not shatter if Billy throws and Suzy does 
not. It is precisely because, in this context, Suzy’s throw hits the bottle and Billy’s does not 
that we declare Suzy’s throw to be the cause of the bottle shattering. AC2(b°) and AC2(b“) 
capture that intuition by forcing us to consider the contingency where BH = 0 (i.e., where 
BH takes on its actual value), despite the fact that Billy throws. 

Of course, just checking that this particular partition into Z and W does not make BT = 1 
a cause does not suffice to establish that BT’ = 1 is not a cause according to the original or 
updated definitions; we must check all possible partitions. I just sketch the argument here: 
The key is to consider whether BH is in W or Z. If BH is in W, then how we set BT has 
no effect on the value of BS: If BH is set to 0, then the value of BS is determined by the 
value of ST’; and if BH is set to 1, then BS = 1 no matter what BT is. (It may seem strange 
to intervene and set BH to 1 if BT = 0. It means that somehow Billy hits the bottle even if 
he doesn’t throw. Such “miraculous” interventions are allowed by the definition; part of the 
motivation for the notion of normality considered in Chapter 3 is to minimize their usage.) 
This shows that we cannot have BH in W. If BH is in Z, then we get the same problem with 
AC2(b°) or AC2(b”) as above; it is easy to see that at least one of SH or ST must be in W, 
and w must be such that whichever is in W is set to 0. 

Note that, in this argument, it is critical that in AC2(b°) and AC2(b”) we allow setting an 
arbitrary subset of variables in Z —X to their original values. There is a trivial reason for this: 
if we set all variables in Z — X to their original values in M/,.-, then, among other things, we 
will set BS to 1. We will never be able to show that Billy’s throw is not a cause if BS' is set 
to 1. But even requiring that all the variables in z= {BS, BT} be set to their original values 
does not work. For suppose that we take W = {ST} and take w such that ST = 0. Clearly 
AC2(a) holds because if BT = 0, then BS' = 0. But if all the variables in Z — {BT, BS} are 
set to their original values, then SH is set to 1, so the bottle shatters. To show that BT = 1 
is not a cause, we must be able to set BH to its original value of 0 while keeping SH at 0. 
Setting BH to 0 captures the intuition that Billy’s throw is not a cause because, in the actual 
world, his rock did not hit the bottle (BH = 0). 

Finally, consider the modified HP definition. Here things are much simpler. In this case, 
taking W = {BT} does not work to show that ST = 1 is a cause of BS = 1; we are 
not allowed to change BT from 1 to 0. However, we can take W = {BH}. Fixing BH 
at O (its setting in the actual context), if ST is set to 0, then BS = 0, that is, (M,u) - 
[ST <— 0, BH < 0](BS = 0). Thus, ST = lis acause of BS = 1 according to the modified 
HP definition. (Taking W = { BH} would also have worked to show that ST = 1 is a cause 
of BS = 1 according to the original and updated HP definition. Indeed, by Theorem 2.2.3, 
showing that ST = 1 is acause of BS = 1 according to the modified definition suffices to 
show that it is also a cause according to the original and updated definition. I went through 
the argument for the original and updated definition because the more obvious choice for Ww 
in that case is {BT}, and it works under those definitions.) But now it is also easy to check 
that BT = 1 is not a cause of BS = 1 according to the modified HP definition. No matter 
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which subset of variables other than BT’ are held to their values in wu and no matter how we 
set BT, we have SH = 1 and thus BS = 1. 

I did not completely specify M4, in the example above. Specifically, I did not specify 
the set of possible contexts. What happens in other contexts is irrelevant for determining 
whether ST = 1 or BT = 1 is a cause of BS = 1 in (Mr, wu). But once I take normality 
considerations into account in Chapter 3, it will become relevant. I described u above as the 
context where Billy and Suzy both throw, implicitly suggesting that there are four contexts, 
determining the four possible combinations of Billy throwing and Suzy throwing. So we can 
take Mp: to be the model with these four contexts. 

There is arguably a problem with this definition of Mj: it “bakes in” the temporal order- 
ing of events, in particular, that Suzy’s rock hits before Billy’s rock. If we want a model that 
describes “rock-throwing situations for Billy and Suzy” more generally, we would not want 
to necessarily assume that Billy and Suzy are accurate, nor that Suzy always hits before Billy 
if they both throw. Whatever assumptions we make along these lines can be captured by the 
context. That is, we can have the context determine (a) whether Suzy and Billy throw, (b) how 
accurate they are (would Billy hit if he were the only one to throw, and similarly for Suzy), 
and (c) whose rock hits first if they both throw and are accurate (allowing for the possibility 
that they both hit at the same time). This richer model would thus have 48 contexts: there are 
four choices of which subset of Billy and Suzy throw, four possibilities for accuracy (both are 
accurate, Suzy is accurate and Billy is not, and so on), and three choices regarding who hits 
first if they both throw. If we use this richer model, we would also want to expand the lan- 
guage to include variables such as S'A (Suzy is accurate), BA (Billy is accurate), SF’ (Suzy’s 
throw arrives first if both Billy and Suzy throw), and BF ' (Bily’s throw arrives first). Call this 
richer model Mj. In Mj, the values of ST, BT, SA, BA, SF, and BF are determined 
by the context. As with Mpr and M'p-, the bottle shatters if either Suzy or Billy hit it, so 
we have the equation BS = BH V SH. But now the equations for BH and SH depend on 
ST, BT, SA, BA, SF, and BF. Of course SH = 0 if Suzy doesn’t throw (i.e., if ST’ = 0) 
or is inaccurate (SA = 0), but even in this case, the equations need to tell us what would have 
happened had Suzy thrown and been accurate, so that we can determine the truth of formulas 
such as [ST < 1, SA < 1](SH = 0). As suggested by the discussion above, the equation 
for SH is 


1 if ST =1, SA =1,andeither BT = 0, BA =0, or SF = 1; 
SH = : 
0 otherwise. 


In words, Suzy hits the bottle if she throws and is accurate, and either (a) Billy doesn’t throw, 
(b) Billy is inaccurate, or (c) Billy throws and is accurate but Suzy’s rock arrives at the bottle 
first (or at the same time as Billy’s); otherwise, Suzy doesn’t hit the bottle. 

The context w in ‘pp corresponds to the context u* in 14}; where Billy and Suzy both 
throw, both are accurate, and Suzy’s throw hits first. Exactly the same argument as that above 
shows that ST = 1 is a cause of (M},u*), and BT = 1is not. In the model Mfr, the 
direction of the edge in the causal network between SH and BH depends on the context. In 
some contexts (the ones where both Suzy and Billy are accurate and Suzy would hit first if 
they both throw), BH depends on SH; in other contexts, SH depends on BH. As a result, 
My is not what I called a strongly recursive model. M7,p is still a recursive model because 
in each context u’ in Mj,p, there are no cyclic dependencies; there is still an ordering <,," on 
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the endogenous variables such that unless X =<, Y, Y is independent of X in (Mpr,w’). In 
particular, although SH ~<,,» BH and there is a context wu” in Mp such that BH x, SH, 
Mp, is still considered recursive. 

It is critical in this analysis of the rock-throwing example that the model includes the 
variables BH and SH, that is, that Billy’s rock hitting the bottle is taken to be a different 
event than Suzy’s rock hitting the bottle (although there is no problem if it includes extra 
variables). To understand the need for BH and SH (or some analogous variables), consider 
an observer watching the situation. Why would she declare Suzy’s throw to be a cause and 
Billy’s throw not to be a cause? Presumably, precisely because it was Suzy’s rock that hit the 
bottle and not Billy’s. If this is the reason for the declaration, then it must be modeled. Of 
course, using the variables BH and SH is not the only way of doing this. All that is needed 
is that the language be rich enough to allow us to distinguish the situation where Suzy’s rock 
hits and Billy’s rock does not from the one where Billy’s rock hits and Suzy’s rock does 
not. (That is why I said “or some analogous variables” above.) An alternative approach to 
incorporating temporal information is to have time-indexed variables (e.g., to have a family 
of variables BS, for “bottle shatters at time k” and a family H;, for “bottle is hit at time k”). 
With these variables and the appropriate equations, we can dispense with SH and BH and 
just have the variables H,, H2,... that talk about the bottle being hit at time /, without the 
variable specifying who hit the bottle. For example, if we assume that all the action happens 
at times 1, 2, and 3, then the equations are H; = ST (if Suzy throws, then the bottle is hit at 
time 1), H2 = BT A =H, (the bottle is hit at time 2 if Billy throws and the bottle was not 
already hit at time 1), and BS = H, V Hg (if the bottle is hit, then it shatters). Again, these 
equations model the fact that Suzy hits first if she throws. And again, we could assume that 
the equations for H, and H2 are context-dependent, so that Suzy hits first in u, but Billy hits 
first in u’. In any case, in context u where Suzy and Billy both throw but Suzy’s rock arrives 
at the bottle first, Suzy is still the cause, and Billy isn’t, using essentially the same analysis as 
that above. 

To summarize the key point here, the language (i.e., the variables chosen and the equations 
for them) must be rich enough to capture the significant features of the story, but there is more 
than one language that can do this. As we shall see in Chapter 4, the choice of language can 
in general have a major impact on causality judgments. ff 


The rock-throwing example emphasizes an important moral. If we want to argue in a case 
of preemption that X = x rather than Y = y is the cause of y, then there must be a variable 
(BH in this case) that takes on different values depending on whether X = x or Y = y is 
the actual cause. If the model does not contain such a variable, then it will not be possible to 
determine which one is in fact the cause. The need for such variables is certainly consistent 
with intuition and the way we present evidence. If we want to argue (say, in a court of law) 
that it was A’s shot that killed C and not B’s, then, assuming that A shot from C’s left and 
B from the right, we would present evidence such as the bullet entering C’ from the left side. 
The side from which the shot entered is the relevant variable in this case. The variable may 
involve temporal evidence (if C’s shot had been the lethal one, then the death would have 
occurred a few seconds later), but it certainly does not have to. 

The rock-throwing example also emphasizes the critical role of what happens in the actual 
situation in the definition. Specifically, we make use of the fact that, in the actual context, 
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Billy’s rock did not hit the bottle. In the original and updated HP definition, this fact is applied 
in AC2(b): to show that Billy is a cause, we would have to show that the bottle shatters when 
Billy throws, even if BH is set to its actual value of 0. In the modified HP definition, it is 
applied in AC2(a”) to show that Suzy’s throw is a cause: we can set ST = 0 while keeping 
BH fixed at 0. The analogous argument would not work to show that Billy’s throw is a cause. 

Although what happens in the actual context certainly affects people’s causality judgments, 
why is the way that this is captured in AC2(a), AC2(b°), or AC2(b“) the “right” way of cap- 
turing it? I do not have a compelling answer to this question, beyond showing that these 
definitions work well in examples. I have tried to choose examples that, although not always 
realistic, capture important features of causality. For example, while the rock-throwing ex- 
ample may not seem so important in and of itself, it is an abstraction of a situation that arises 
frequently in legal cases, where one potential cause is preempted by another. Monopolistic 
practices by a big company cause a small company to go bankrupt, but it would have gone 
bankrupt anyway because of poor management; a smoker dies in a car accident, but he would 
have died soon due to inoperable lung cancer had there not been an accident. A good def- 
inition of causality is critical for teasing out what is a cause and what is not in such cases. 
Many of the other examples in this section are also intended to capture the essence of other 
important issues that arise in legal (and everyday) reasoning. It is thus worth understanding 
in all these cases exactly what the role of the actual context is. 


Example 2.3.4 This example considers the problem of what has been called double preven- 
tion. 


Suzy and Billy have grown up just in time to get involved in World War III. Suzy 
is piloting a bomber on a mission to blow up an enemy target, and Billy is piloting 
a fighter as her lone escort. Along comes an enemy fighter plane, piloted by 
Enemy. Sharp-eyed Billy spots Enemy, zooms in, and pulls the trigger; Enemy’s 
plane goes down in flames. Suzy’s mission is undisturbed, and the bombing takes 
place as planned. 


Is Billy a cause of the success of the mission? After all, he prevented Enemy from prevent- 
ing Suzy from carrying out the mission. Intuitively, it seems that the answer is yes, and the 
obvious causal model gives us this. Suppose that we have the following variables: 


= BPT for “Billy pulls trigger’, with values 0 (he doesn’t) and | (he does); 
= KE for “Enemy eludes Billy”, with values 0 (he doesn’t) and | (he does); 
» ESS for “Enemy shoots Suzy”, with values 0 (he doesn’t) and 1| (he does); 
= SBT for “Suzy bombs target”, with values 0 (she doesn’t) and | (she does); 
= TD for “target destroyed”, with values 0 (it isn’t) and | (it is). 

The causal network corresponding to this model is given in Figure 2.4. 


In this model, BPT = 1 is a but-for cause of TD = 1, asis SBT = 1, so both BPT = 1 
and SBT = 1 are causes of T’D = 1 according to all the variants of the HP definition. 
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BPT EE ESS SBT TD 


Figure 2.4: Blowing up the target. 


We can make the story more complicated by adding a second fighter plane escorting Suzy, 
piloted by Hillary. Billy still shoots down Enemy, but if he hadn’t, Hillary would have. The 
natural way of dealing with this is to add just one more variable HPT representing Hillary’s 
pulling the trigger iff LE = 1 (see Figure 2.5), but then, using the naive but-for criterion, 
one might conclude that the target will be destroyed (TD = 1) regardless of Billy’s action, so 
BPT = 1 would not be acause of TD = 1. All three variants of the HP definition still declare 
BPT = 1 to beacause of TD = 1, as expected. We now take W = {HPT} and fix HPT 
at its value in the actual context, namely, 0. Although Billy’s action seems superfluous under 
ideal conditions, it becomes essential under a contingency where Hillary for some reason fails 
to pull the trigger. This contingency is represented by fixing HPT at 0 irrespective of FE. 


HPT 


BPT EE ESS SBT TD 


Figure 2.5: Blowing up the target, with Billy and Hillary. 


So far, this may all seem reasonable. But now go back to the original story and suppose 
that, again, Billy goes up along with Suzy, but Enemy does not actually show up. Is Billy 
going up a cause of the target being destroyed? Clearly, if Enemy had shown up, then Billy 
would have shot him down, and he would have been a but-for cause of the target being de- 
stroyed. But what about the context where Enemy does not show up? It seems that the original 
and updated HP definition would say that, even in that context, Billy going up is a cause of 
the target being destroyed. For if we consider the contingency where Enemy had shown up, 
then the target would not have been destroyed if Billy weren’t there, but since he is, the target 
is destroyed. 

This may seem disconcerting. Suppose it is known that just about all the enemy’s aircraft 
have been destroyed, and no one has been coming up for days. Is Billy going up still a cause? 
The concern is that, if so, almost any A can become a cause of any B by telling a story of how 
there might have been a C that, but for A, would have prevented B from happening (just as 
there might have been an Enemy that prevented the target being destroyed had Billy not been 
there). 

Does the HP definition really declare Billy showing up a cause of the target being destroyed 
in this case? Let’s look a little more carefully at how we model this story. If we want to call 
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Billy going up a cause, then we need to add to the model a variable BGU corresponding to 
Billy going up. Similarly, we should add a variable ESU corresponding to Enemy showing 
up. We have the obvious equations, which say that (in the contexts of interest) BPT = 1 
(i.e., Billy pulls the trigger) only if Billy goes up and Enemy shows up, and HE = 0 only 
if either Enemy doesn’t show up or Billy pulls the trigger. This gives us the model shown in 


Figure 2.6. 
ESU 
BGU BPT ELE ESS SBT TD 


Figure 2.6: Blowing up the target, where Enemy may not show up. 


In the actual world, where Billy goes up and Enemy doesn’t, Billy doesn’t pull the trigger 
(why should he?), so BPT = 0. Now it is easy to see that Billy going up (i.e., BGU = 1) is 
not a cause of the target being destroyed if Enemy doesn’t show up. AC2(b°) fails (and hence 
so does AC2(b")): if Enemy shows up, the target won’t be destroyed, even if Billy shows up, 
because in the actual world BPT = 0. The modified HP definition addresses the problem 
even more straightforwardly. No matter which variables we fix to their actual values, TD = 1 
evenif BGU =0.4 


Cases of prevention and double prevention are quite standard in the real world. It is stan- 
dard to install fire alarms, which, of course, are intended to prevent fires. When the batteries 
in a fire alarm become weak, fire alarms typically “chirp”. Deactivating the chirping sound 
could be a cause of a fire not being detected due to double prevention: deactivating the chirp- 
ing sound prevents the fire alarm from sounding, which in turn prevents the fire from being 
detected. 


Example 2.3.5 Can not performing an action be (part of) a cause? Consider the following 
story. 


Billy, having stayed out in the cold too long throwing rocks, contracts a serious 
but nonfatal disease. He is hospitalized and treated on Monday, so is fine Tuesday 
morning. 


But now suppose that the doctor does not treat Billy on Monday. Is the doctor’s not treating 
Billy a cause of Billy’s being sick on Tuesday? It seems that it should be, and indeed it is 
according to all variants of the HP definition. Suppose that w is the context where, among 
other things, Billy is sick on Monday and the doctor forgets to administer the medication 
Monday. It seems reasonable that the model should have two variables: 
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= MT for “Monday treatment’, with values 0 (the doctor does not treat Billy on Monday) 
and | (he does); and 


=» BMC for “Billy’s medical condition”, with values 0 (recovered) and 1 (still sick). 


Sure enough, in the obvious causal setting, MT = 0 is a but-for cause of BMC = 1, and 
hence a cause according to all three variants of the HP definition. 

This may seem somewhat disconcerting at first. Suppose there are 100 doctors in the 
hospital. Although only one of them was assigned to Billy (and he forgot to give medication), 
in principle, any of the other 99 doctors could have given Billy his medication. Is the fact that 
they didn’t give him the medication also part of the cause of him still being sick on Tuesday? 

In the causal model above, the other doctors’ failure to give Billy his medication is not a 
cause, since the model has no variables to model the other doctors’ actions, just as there was 
no variable in the causal model of Example 2.3.1 to model the presence of oxygen. Their lack 
of action is part of the context. We factor it out because (quite reasonably) we want to focus 
on the actions of Billy’s doctor. 

If we had included endogenous variables corresponding to the other doctors, then they too 
would be causes of Billy’s being sick on Tuesday. The more refined definition of causal- 
ity given in Chapter 3, which takes normality into account, provides a way of avoiding this 
problem even if the model includes endogenous variables for the other doctors. ff 


Causation by omission is a major issue in the law. To take just one of many examples, a 
surgeon can be sued for the harm caused due to a surgical sponge that he did not remove after 
an operation. 

The next example again emphasizes how the choice of model can change what counts as a 
cause. 


Example 2.3.6 The engineer is standing by a switch in the railroad tracks. A train approaches 
in the distance. She flips the switch, so that the train travels down the right track instead of the 
left. Because the tracks reconverge up ahead, the train arrives at its destination all the same. 

If we model this story using three variables—F' for “flip”, with values 0 (the engineer 
doesn’t flip the switch) and 1 (she does); T' for “track”, with values 0 (the train goes on the 
left track) and 1 (it goes on the right track); and A for “arrival”, with values 0 (the train does 
not arrive at the point of reconvergence) and | (it does)—then all three definitions agree that 
flipping the switch is not a cause of the train arriving. 

But now suppose that we replace T' with two binary variables, LB (for left track blocked), 
which is 0 if the left track is not blocked, and | if it is, and RB (for right track blocked), 
defined symmetrically. Suppose that LB, RB, and F' are determined by the context, while 
the value of A is determined in the obvious way by which track the train is going down 
(which is determined by how the switch is set) and whether the track that it is going down 
is blocked; specifically, A = (fF A =LB) V (AF A >RB). In the actual context, F = 1 and 
LB = RB = 0). Under the original and updated HP definition, F = 1 is a cause of A = 1. 
For in the contingency where LB = 1, if F' = 1, then the train arrives, whereas if F = 0, 
then the train does not arrive. While adding the variables LB and RB suggests that we care 
about whether a track is blocked, it seems strange to call flipping the switch a cause of the 
train arriving when in fact both tracks are unblocked. This problem can be dealt with to some 
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extent by invoking normality considerations, but not completely (see Example 3.4.3). With 
the modified definition, the problem disappears. Flipping the switch is not a cause of the train 
arriving if both tracks are unblocked, nor is it a cause of the train not arriving if both tracks 
are blocked. §f 


Example 2.3.7 Suppose that a captain and a sergeant stand before a private, both shout 
“Charge!” at the same time, and the private charges. Some have argued that, because orders 
from higher-ranking officers trump those of lower-ranking officers, the captain is a cause of 
the charge, whereas the sergeant is not. 

It turns out that which of the sergeant or captain is a cause depends in part on the variant 
of the HP definition we consider and in part on what the possible actions of the sergeant and 
captain are. First, suppose that the sergeant and the captain can each either order an advance, 
order a retreat, or do nothing. Formally, let C and S represent the captain’s and sergeant’s 
order, respectively. Then C and S can each take three values, 1, —1, or 0, depending on the 
order (attack, retreat, no order). The actual value of C' and S is determined by the context. P 
describes what the private does; as the story suggests, P = Cif C 4 0; otherwise P = S. In 
the actual context, C= S= P=1. 

C = 1 isa but-for cause of P = 1; setting C' to —1 would result in P = —1. Thus, C = 1 
is a cause of P = 1 according to all variants of the HP definition. S = 1 is not a cause of 
B = 1 according to the modified HP definition; changing S to 0 while keeping C' fixed at 
1 will not affect the private’s action. However, S = 1 is a cause of P = 1 according to the 
original and updated HP definition, with witness ({C}, 0,0) (ie., in the witness world, C' = 0 
and S = 0): if the captain does nothing, then what the private does is completely determined 
by the sergeant’s action. 

Next, suppose that the captain and sergeant can only order an attack or a retreat; that is, 
the range of each of C and S$ is {—1,1}. Now it is easy to check that C = 1 is the only 
cause of P = 1 according to all the variants of the HP definition. For the original and updated 
definition, S' = 1 is not a cause because there is no setting of C’ that will allow the sergeant’s 
action to make a difference. 

Finally, suppose that the captain and sergeant can only order an attack or do nothing; that is, 
the range of each of C and S is {0,1}. C = 1 and S = 1 are both causes of P = 1 according 
to the original and updated HP definition, using the same argument used to show that S = 1 
was a cause when it was also possible to order a retreat. Neither is a cause according to the 
modified HP definition, but C = 1 A S = 1 isa cause, so both C = 1 and S = 1 are parts of 
a cause. 

Note that if the range of the variables C' and S is {0, 1}, then there is no way to capture the 
fact that in the case of conflicting orders, the private obeys the captain. In this setting, we can 
capture the fact that the private is “really” obeying the captain if both the captain and sergeant 
give orders by adding a new variable SF that captures the sergeant’s “effective” order. If the 
captain does not issue any orders (i.e., if C = 0), then SE = S. If the captain does issue 
an order, then SE = 0; the sergeant’s order is effectively blocked. In this model, P = C if 
C #0; otherwise, P = SE. The causal network for this model is given in Figure 2.7. 

In this model, the captain causes the private to advance, but the sergeant does not, according 
to all variants of the HP definition. To see that the captain is a cause according to the modified 
HP definition, hold SE fixed at 0 and set C' = 0; clearly P = 0. By Theorem 2.2.3, C = 1is 
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SE 


P 


Figure 2.7: A model of commands that captures trumping. 


a cause of P = 1 according to the original and updated definition as well. To show that S = 1 
is not (part of) a cause according to all variants of the HP definition, it suffices to show it is 
not a cause according to the original HP definition. Suppose that we want to argue that S = 1 
causes P = 1. The obvious thing to do is to take W = {C} and Z = {S, SE, P}. However, 
this choice does not satisfy AC2(b°): if C = 0, SE = 0 (its original value), and S' = 1, then 
P = 0, not 1. The key point is that this more refined model allows a setting where C = 0, 
S =1,and P = 0 (because SE = 0). That is, despite the sergeant issuing an order to attack 
and the captain being silent, the private does nothing. ff 


The final example in this section touches on issues of responsibility in the law. 


Example 2.3.8 Suppose that two companies both dump pollutant into the river. Company A 
dumps 100 kilograms of pollutant; company B dumps 60 kilograms. The fish in the river die. 
Biologists determine that k kilograms of pollutant suffice for the fish to die. Which company 
is the cause of the fish dying if & = 120, if k = 80, and if k = 50? 

It is easy to see that if k = 120, then both companies are causes of the fish dying, according 
to all three definitions (each company is a but-for cause of the outcome). If & = 50, then each 
company is still a cause according to the original and updated HP definition. For example, to 
see that company B is a cause, we consider the contingency where company A does not dump 
any pollutant. Then the fish die if company B pollutes, but they survive if B does not pollute. 
With the modified definition, neither company individually is a cause; there is no variable that 
we can hold at its actual value that would make company A or company B a but-for cause. 
However, both companies together are the cause. 

The situation gets more interesting if k = 80. Now the modified definition says that only 
Ais a cause; if A dumps 100 tons of pollutant, then what B does has no impact. The original 
and updated definition also agree that A is a cause if k = 80. Whether B is a cause depends 
on the possible amounts of pollutant that A can dump. If A can dump only 0 or 100 kilograms 
of pollutant, then B is not a cause; no setting of A’s action can result in B’s action making a 
difference. However, if A can dump some amount between 20 and 79 kilograms, then B is a 
cause. 

It’s not clear what the “right” answer should be here if k = 80. The law typically wants to 
declare B a contributing cause to the death of the fish (in addition to A), but should this depend 
on the amount of pollutant that A can dump? As we shall see in Section 6.2, thinking in 
terms of responsibility and blame helps clarify the issue (see Example 6.2.5). Under minimal 
assumptions about how likely various amounts of pollutant are to be dumped, B will get some 
degree of blame according to the modified definition, even when it is not a cause. ff 
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2.4 Transitivity 


Example 2.4.1 Now consider the following modification of Example 2.3.5. 


Suppose that Monday’s doctor is reliable and administers the medicine first thing 
in the morning, so that Billy is fully recovered by Tuesday afternoon. Tuesday’s 
doctor is also reliable and would have treated Billy if Monday’s doctor had failed 
to. ... And let us add a twist: one dose of medication is harmless, but two doses 
are lethal. 


Is the fact that Tuesday’s doctor did not treat Billy the cause of him being alive (and recovered) 
on Wednesday morning? 
The causal model Mp for this story is straightforward. There are three variables: 


= MT for Monday’s treatment (1 if Billy was treated Monday; 0 otherwise); 
= TT for Tuesday’s treatment (1 if Billy was treated Tuesday; 0 otherwise); and 


« BMC for Billy’s medical condition (0 if Billy feels fine both Tuesday morning and 
Wednesday morning; | if Billy feels sick Tuesday morning, fine Wednesday morning; 
2 if Billy feels sick both Tuesday and Wednesday morning; 3 if Billy feels fine Tuesday 
morning and is dead Wednesday morning). 


We can then describe Billy’s condition as a function of the four possible combinations of 
treatment/nontreatment on Monday and Tuesday. I omit the obvious structural equations cor- 
responding to this discussion; the causal network is shown in Figure 2.8. 


MT 


TT 


BMC 


Figure 2.8: Billy’s medical condition. 


In the context where Billy is sick and Monday’s doctor treats him, MT = 1 is a but-for 
cause of BMC = 0; if Billy had not been treated on Monday, then Billy would not have felt 
fine on Tuesday. MT = 1 is also a but-for cause of TT’ = 0, and T'T' = 0 is a but-for cause 
of Billy’s being alive (BMC F 3, or equivalently, BMC = 0V BMC =1V BMC = 2). 
However, MT = 1 is not part of a cause of Billy’s being alive, according to any variant of the 
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HP definition. It suffices to show that it is not a cause according the original HP definition. 
This follows because setting MMT’ = 0 cannot result in Billy dying, not matter how we fix TT’. 

This shows that causality is not transitive according to the HP definition. Although MT = 
1 is acause of TT = O and TT = Oisacause of BMC =0V BMC =1V BMC = 2, 
MT = 1is nota cause of BMC =0V BMC =1V BMC = 2. Nor is causality closed 
under right weakening; that is, replacing a conclusion by something it implies. If A is a cause 
of B and B logically implies B’, then A may not be a cause of B’. In this case, MT = lisa 
cause of BMC' = 0, which logically implies BMC =0V BMC =1V BMC = 2, which is 
not caused by MT = 1.9 


Although this example may seem somewhat forced, there are many quite realistic exam- 
ples of lack of transitivity with exactly the same structure. Consider the body’s homeostatic 
system. An increase in external temperature causes a short-term increase in core body tem- 
perature, which in turn causes the homeostatic system to kick in and return the body to normal 
core body temperature shortly thereafter. But if we say that the increase in external tempera- 
ture happened at time 0 and the return to normal core body temperature happened at time 1, 
we certainly would not want to say that the increase in external temperature at time 0 caused 
the body temperature to be normal at time 1! 

There is another reason that causality is intransitive, which is illustrated by the following 
example. 


Example 2.4.2 Suppose that a dog bites Jim’s right hand. Jim was planning to detonate a 
bomb, which he normally would do by pressing the button with his right forefinger. Because 
of the dog bite, he presses the button with his left forefinger. The bomb still goes off. 

Consider the causal model with variables DB (the dog bites, with values 0 and 1), P (the 
press of the button, with values 0, 1, and 2, depending on whether the button is not pressed 
at all, pressed with the right hand, or pressed with the left hand), and B (the bomb goes off). 
We have the obvious equations: DB is determined by the context, P = DB +1, and B = 1 
if P is either 1 or 2. In the context where DB = 1, it is clear that DB = 1 is a but-for cause 
of P = 2 (if the dog had not bitten, P would have been 1), and P = 2 is a but-for cause of 
B = 1 (if P were 0, then B woud be 0), but DB = 1 is not a cause of P = 1. Regardless 
of whether the dog had bitten Jim, the button would have been pressed, and the bomb would 
have detonated. §j 


In retrospect, the failure of right weakening is not so surprising. Taking true to be a tau- 
tology, if A is a cause of B, then we do not want to say that A is a cause of true, although B 
logically implies true. However, the failure of transitivity is quite surprising. Indeed, despite 
Examples 2.4.1 and 2.4.2, it seems natural to think of causality as transitive. People often 
think in terms of causal chains: A caused B, B caused C, C caused D, and therefore A 
caused D, where transitivity seems natural (although the law does not treat causality as tran- 
sitive in long causal chains; see Section 3.4.4). Not surprisingly, there are some definitions 
of causality that require causality to be transitive; see the notes at the end of the chapter for 
details. 

Why do we feel that causality is transitive? I believe that this is because, in typical settings, 
causality is indeed transitive. I give below two simple sets of conditions that are sufficient to 
guarantee transitivity. Because I expect that these conditions apply in many cases, it may 


Downloaded from http://direct.mit.edu/books/book-pdf/2262849/book_9780262336611.pdf by guest on 04 April 2024 


2.4 Transitivity 43 


explain why we naturally generalize to thinking of causality as transitive and are surprised 
when it is not. 

For the purposes of this section, I restrict attention to but-for causes. This is what the law 
has focused on and seems to be the situation that arises most often in practice (although I will 
admit I have only anecdotal evidence to support this); certainly the law has done reasonably 
well by considering only but-for causality. Restricting to but-for causality has the further 
advantage that all the variants of the HP definition agree on what counts as a cause. (Thus, in 
this section, I do not specify which variant of the definition I am considering.) Restricting to 
but-for causality does not solve the transitivity problem. As Examples 2.4.1 and 2.4.2 already 
show, even if X; = 2 is a but-for cause of X2 = x2 and X2g = 22 is a but-for cause of 
X3 = £3, it may not be the case that X¥, = 7 is a cause of X3 = 73. 

The first set of conditions assumes that X,, X2, and X3 each has a default setting. Such 
default settings make sense in many applications; “nothing happens” can be often taken as the 
default. Suppose, for example, a billiards expert hits ball A, causing it to hit ball B, causing 
it to carom into ball C’, which then drops into the pocket. In this case, we can take the default 
setting for the shot to be the expert doing nothing and the default setting for the balls to be 
that they are not in motion. Let the default setting be denoted by the value 0. 


Proposition 2.4.3 Suppose that 
(a) X 1 = 2%; is a but-for cause of Xz = x2 in (M,%), 
(b) Xq = X2q is a but-for cause of X3 = x3 in(M,%), 
(c) r3 #0, 
(d) (M, it) E [X1 < 0](X2 = 0), and 
(e) (M,w) — [X1 — 0, X2 + 0](X3 = 0). 

Then X, = xj is a but-for cause of X3 = x3 in (M, t). 


Proof: If X2 = 0 in the unique solution to the equations in the causal model Mx, —o in 
context u and X3 = 0 in the unique solution to the equations in Mx, —0,x,0 in context u, 
then it is immediate that X3 = 0 in the unique solution to the equations in x, .-9 in context 
u. That is, (M,u) — [X1 < 0](X3 = 0). It follows from assumption (a) that (M,w) | 
X, = x;. We must thus have x; 4 0; otherwise, (/, tv) E X1 = 0A [X1 < O0](X3 = 0), 
so (M,t) / X3 = 0, which contradicts assumptions (b) and (c). Thus, X; = 2; is a but-for 
cause of X3 = x3, since the value of X3 depends counterfactually on that of X,. fl 


Although the conditions of Proposition 2.4.3 are clearly rather specialized, they arise often 
in practice. Conditions (d) and (e) say that if X, remains in its default state, then so will 
X2, and if both X; and X2 remain in their default states, then so will X3. Put another way, 
this says that the reason for X2 not being in its default state is Xj not being in its default 
state, and the reason for X3 not being in its default state is X; and X2 both not being in their 
default states. The billiard example can be viewed as a paradigmatic example of when these 
conditions apply. It seems reasonable to assume that if the expert does not shoot, then ball A 
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does not move; and if the expert does not shoot and ball A does not move (in the context of 
interest), then ball B does not move, and so on. 

Of course, the conditions on Proposition 2.4.3 do not apply in either Example 2.4.1 or 
Example 2.4.2. The obvious default values in Example 2.4.1 are MT’ = TT = O, but 
the equations say that in all contexts uw of the causal model Mg for this example, we have 
(Mep,ti) — [MT < 0|(TT = 1). In the second example, if we take DB = 0 and P = 0 to 
be the default values of DB and P, then in all contexts w of the causal model Mp, we have 
(Mp,t) — [DB <— 0|(P = 1). 

While Proposition 2.4.3 is useful, there are many examples where there is no obvious 
default value. When considering the body’s homeostatic system, even if there is arguably a 
default value for core body temperature, what is the default value for the external temperature? 
But it turns out that the key ideas of the proof of Proposition 2.4.3 apply even if there is 


no default value. Suppose that X; = x, is a but-for cause of X2 = 22 in (MW, wv) and 
X2 = 4 is a but-for cause of X3 = x3 in (M,t). Then to get transitivity, it suffices 
to find values xi, 75, and x4 such that x3 A #5, (M,t@) —& [Xi < w{](X2 = 24), and 


(M,u) — [X1 « a), X2 < v5](X3 = x5). The argument in the proof of Proposition 2.4.3 
(formalized in Lemma 2.10.2) shows that (MM, a) & [X1 < 2{](X3 = 2). It then follows 
that X, = 2 is a but-for cause of X3 = x3 in (M, i). In Proposition 2.4.3, x, x4, and 
x’, were all 0, but there is nothing special about the fact that 0 is a default value here. As 
long as we can find some values x/, x5, and x, these conditions apply. I formalize this as 
Proposition 2.4.4, which is a straightforward generalization of Proposition 2.4.3. 


Proposition 2.4.4 Suppose that there exist values x',, x4, and x, such that 
(a) X1 = & is a but-for cause of Xz = £2 in (M, %w), 
(b) Xq = X29 is a but-for cause of X3 = x3 in (M,%), 
(c) x3 # 23, 
(d) (M,t) — [X1 < a4] (Xo = 25) (ie, (X2)0 (tu) = x4), and 
(e) (M, ti) — [Xy — a, Xo v](X3 = 24) (i.e, (X3) x! «4 (w) = £3). 
Then X, = x, is a but-for cause of X3 = x3 in (M, t). 


To see how these ideas apply, suppose that a student receives an A+ in a course, which 
causes her to be accepted at Cornell University (her top choice, of course!), which in turn 
causes her to move to Ithaca. Further suppose that if she had received an A in the course, 
she would have gone to university U; and as a result moved to city C1, and if she had gotten 
anything else, she would have gone to university at U2 and moved to city C2. This story can 
be captured by a causal model with three variables: G' for her grade, U for the university she 
goes to, and C for the city she moves to. There are no obvious default values for any of these 
three variables. Nevertheless, we have transitivity here: The student’s A+ was a cause of her 
being accepted at Cornell, and being accepted at Cornell was a cause of her move to Ithaca; 
it seems like a reasonable conclusion that the student’s A+ was a cause of her move to Ithaca. 
Indeed, transitivity follows from Proposition 2.4.4. We can take the student getting an A to be 
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x',, the student being accepted at university U; to be x4, and the student moving to C; to be 
x (assuming that U, is not Cornell and that C{ is not Ithaca, of course). 

The conditions provided in Proposition 2.4.4 are not only sufficient for causality to be 
transitive, they are necessary as well, as the following result shows. 


Proposition 2.4.5 If X, = 2, is a but-for cause of X3 = x3 in (M,%), then there exist 
values x, x4, and x; such that x3 # xv, (M,t) — [X1 < 2']|(X2 = 24), and (M,%) — 
[X41 = x, X2 = x5|(X3 = £5). 


Proof: Since X; = 2; is a but-for cause of X3 = «x3 in (M,i), there must exist a 
value x # a, such that (M,ii) —& [X, < «a{](X3 A a3). Let x and 7 A az 
be such that (M,@) — [X, < a)|(Xo = 25 A X3 = 24). It easily follows that 


(M, u) = [X1 © 21, Xe — x9](X3 = 25). W 


In light of Propositions 2.4.4 and 2.4.5, understanding why causality is so often taken 
to be transitive comes down to finding sufficient conditions to guarantee the assumptions of 
Proposition 2.4.4. I now present a set of conditions sufficient to guarantee the assumptions 
of Proposition 2.4.4, motivated by the two examples showing that causality is not transitive. 
To deal with the problem in Example 2.4.2, I require that for every value x‘, in the range of 
Xo, there is a value x’, in the range of X, such that (M,w) — [X1 < «/|(X2q = x4). This 
requirement holds in many cases of interest; it is guaranteed to hold if X; = 2, is a but-for 
cause of Xz = x2 and X42 is binary (since but-for causality requires that two different values 
of X, result in different values of X2). But this requirement does not hold in Example 2.4.2; 
no setting of DB can force P to be 0. 

Imposing this requirement still does not deal with the problem in Example 2.4.1. To do 
that, we need one more condition: X2 must lie on every causal path from X; to X3. Roughly 
speaking, this says that all the influence of X; on X3 goes through X2. This condition does 
not hold in Example 2.4.1: as Figure 2.8 shows, there is a direct causal path from MT to 
BMC that does not include TT’. However, this condition does hold in many cases of interest. 
Going back to the example of the student’s grade, the only way that the student’s grade can 
influence which city the student moves to is via the university that accepts the student. 

To make this precise, I first need to define causal path. A causal path in a causal setting 
(M, wi) is a sequence (Y},..., Y,) of variables such that Y;41 depends on Y; in context w for 
j =1,...,k—1. Since there is an edge between Y; and Y;,, in the causal network for Mf 
(assuming that we fix context vw) exactly if Y;1 depends on Y;, a causal path is just a path in 
the causal network. A causal path in (M,%t) from X, to Xz is just a causal path whose first 
node is X; and whose last node is X9. Finally, Y lies on a causal path in (M, i) from X, to 
X» if Y is a node (possibly X, or X2) on a causal path in (IM, w) from Xj to Xo. 

The following result summarizes the second set of conditions sufficient for transitivity. 
(Recall that R(X ) denotes the range of the variable X.) 


Proposition 2.4.6 Suppose that X, = x, is a but-for cause of Xz = X2 in the causal setting 
(M, tu), X2 = x2 is a but-for cause of X3 = x3 in (M,%), and the following two conditions 
hold: 


(a) for every value x € R(X), there exists a value x, € R(X 1) such that (M,t) = 
[X, 24 )(La = wh) (be, (Na)as (@) = ahs 
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(b) Xz is on every causal path in (M, a) from X, to X3. 
Then X1 = x, is a but-for cause of X3 = £3. 


The proof of Proposition 2.4.6 is not hard, although we must be careful to get all the details 
right. The high-level idea of the proof is easy to explain. Suppose that Xj = 22 is a but-for 
cause of X3 = x3 in (M,i). Then there must be some values x2 4 x4 and 23 4 x5 such 
that (M, i)  [X2 «+ «5](X3 = x4). By assumption, there exists a value x, € R(X1) such 
that (M, i) — [X1 < 2{](X2 = x4). The requirement that X> is on every causal path from 
X, to X3 guarantees that [X2 < 24](X3 = x3) implies [X, < 24, X2 < x5](X3 = x3) in 
(M, t) Ge., (X3)o (ud) = v3 implies (X3)./ 24 (U) = v3). Roughly speaking, X» “screens 
off” the effect of X, on X3, since it is on every causal path from X, to X3. Now we can 
apply Proposition 2.4.4. I defer the formal argument to Section 2.10.2. 

It is easy to construct examples showing that the conditions of Proposition 2.4.6 are not 
necessary for causality to be transitive. Suppose that X; = x, causes Xg = Xo, X2 = XQ 
causes X3 = #3, and there are several causal paths from X, to X3. Roughly speaking, the 
reason that X, = x, may not be a but-for cause of X3 = 23 is that the effects of X, on X3 
may “cancel out” along the various causal paths. This is what happens in the homeostasis 
example. If X2 is on all the causal paths from X, to X3, then as we have seen, all the effect 
of X; on X3 is mediated by Xo, so the effect of X, on X3 on different causal paths cannot 
“cancel out”. But even if X42 is not on all the causal paths from X, to X3, the effects of X, on 
X3 may not cancel out along the causal paths, and X; = x, may still be a cause of X3 = %3. 
That said, it seems difficult to find a weakening of the condition in Proposition 2.4.6 that is 
simple to state and suffices for causality to be transitive. 


2.5 Probability and Causality 


In general, an agent trying to determine whether A is a cause of B may not know the exact 
model or context. An agent may be uncertain about whether Suzy and Billy both threw, or 
just Suzy threw; he may be uncertain about who will hit first if both Suzy and Billy throw 
(or perhaps they hit simultaneously). In a perhaps more interesting setting, at some point, 
people were uncertain about whether smoking caused cancer or whether smoking and cancer 
were both the outcome of the same genetic problem, which would lead to smoking and cancer 
being correlated, but no causal relationship between the two. To deal with such uncertainty, 
an agent can put a probability on causal settings. For each causal setting (1, ), the agent 
can determine whether A is a cause of B in (/,%), and so compute the probability that A 
is a cause of B. Having probabilities on causal settings will be relevant in the discussion of 
blame in Chapter 6. Here I focus on what seems to be a different source of uncertainty: un- 
certainty in the equations. In causal models, all the equations are assumed to be deterministic. 
However, as was observed, in many cases, it seems more reasonable to think of outcomes 
as probabilistic. So rather than thinking of Suzy’s rock as definitely hitting the bottle if she 
throws, we can think of her as hitting with probability .9. That is, rather than taking Suzy to 
always be accurate, we can take her to be accurate with some probability. 

The first step in considering how to define causality in the presence of uncertain outcomes 
is to show how this uncertainty can be captured in the formal framework. Earlier, I assumed 
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that, for each endogenous variable X, there was a deterministic function F’x that described 
the value of X as a function of all the other variables. Now, rather than assuming that F’y 
returns a particular value, I assume that it returns a distribution over the values of X given the 
values of all other variables. This is perhaps easiest to understand by means of an example. 
In the rock-throwing example, rather than assuming that if Suzy throws a rock, she definitely 
hits the bottle, suppose we assume that she only only hits it with probability .9 and misses 
with probability .1. Suppose that if Billy throws and Suzy hasn’t hit the bottle, then Billy will 
hit it with probability .8 and miss with probability .2. Finally, for simplicity, assume that, if 
hit by either Billy or Suzy, the bottle will definitely shatter. 


The next two paragraphs show how the probabilistic version of this story can be modeled 
formally in the structural-equations framework and can be skipped on a first reading. Recall 
that for the more sophisticated model M‘,, of the (deterministic) rock-throwing story, al- 
though I earlier wrote the equation for SH as SH = ST, this is really an abbreviation for the 
function F's (i1, 12, 73, i4, 75) = ig: F's is a function of the values of the exogenous variable 
U (which is not described in the causal network but determines the values of BT and ST’) and 
the endogenous variables ST, BT, BH, and BS. These values are given by 71,...,%5. The 
equation SH = ST says that the value of SH depends only on the value of ST and is equal to 
it; that explains the output 72. For the probabilistic version of the story, F'gy(i1, 1, %3, i4, 5) 
is the probability distribution that places probability .9 on the value | and probability .1 on 
the value 0. In words, this says that if Suzy throws (ST = 1), then no matter what the values 
i1, 13, 14, and is of U, BT, BH, and BS are, respectively, F's7 (71, 1,73, 14, is) is the proba- 
bility distribution that puts probability .9 on the event SH = 1 and probability .1 on the event 
SH = 0. This captures the fact that if Suzy throws, then she has a probability .9 of hitting 
the bottle (no matter what Billy does). Similarly, F's7 (71,0, 73,74, i5) is the distribution that 
places probability 1 on the value 0: if Suzy doesn’t throw, then she certainly won’t hit the 
bottle, so the event SH = 0 has probability 1. 


Similarly, in the deterministic model of the rock-throwing example, The function f’gy is 
such that py (71, 12,73, 44,75) = 1 if (43,24) = (1,0), and is 0 otherwise. That is, taking 
71,...,%5 to be the values of U, ST, BT, SH, and BS, respectively, Billy’s throw hits the 
bottle only if he throws (73 = 1) and Suzy doesn’t hit the bottle (24 = 0). For the probabilistic 
version of the story, Fgy (71, 72, 73, i4, ¢5) is the distribution that puts probability .8 on 1 and 
probability .2 on 0; for all other values of i3 and i4, F'py(i1,%2,1,0,%5) puts probability 
1 on 0. (Henceforth, I just describe the probabilities in words rather than using the formal 
notation.) 


The interpretation of these probabilities is similar to the interpretation of the deterministic 
equations we have used up to now. For example, the fact that Suzy’s rock hits with probability 
.9 does not mean that the probability of Suzy’s rock hitting conditional on her throwing is .9; 
rather, it means that if there is an intervention that results in Suzy throwing, the probability 
of her hitting is .9. The probability of rain conditional on a low barometer reading is high. 
However, intervening on the barometer reading, say, by setting the needle to point to a low 
reading, does not affect the probability of rain. 


There have been many attempts to give a definition of causality that takes probability into 
account. They typically take A to be a cause of B if A raises the probability of B. I am going 
to take a different approach here. I take a standard technique in computer science to “pull 
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out the probability”, allowing me to convert a single causal setting where the equations are 
probabilistic to a probability over causal settings, where in each causal setting, the equations 
are deterministic. This, in turn, will allow me to avoid giving a separate definition of prob- 
abilistic causality. Rather, I will be able to use the definition of causality already given for 
deterministic models and talk about the probability of causality, that is, the probability that 
A is acause of B. As we shall see, this approach seems to deal naturally with a number of 
problems regarding probabilistic causality that have been raised in the literature. 

The assumption is that the equations would be deterministic if we knew all the relevant 
details. This assumption is probably false at the quantum level; most physicists believe that 
at this level, the world is genuinely probabilistic. But for the macroscopic events that I focus 
on in this book (and that most people are interested in when applying causality judgments), 
it seems reasonable to view the world as fundamentally deterministic. With this viewpoint, if 
Suzy hits the rock with probability .9, then there must be (perhaps poorly understood) reasons 
that cause her to miss: an unexpected gust of wind or a momentary distraction throwing off 
her aim. We can “package up” all these factors and make them exogenous. That is, we can 
have an exogenous variable U’ with values either 0 or 1 depending on whether the features 
that cause Suzy to miss are present, where U’ = 0 with probability .1 and U’ = 1 with 
probability .9. By introducing such a variable U’ (and a corresponding variable for Billy), 
we have essentially pulled the probability out of the equations and put it into the exogenous 
variable. 

I now show in more detail how this approach plays out in the context of the rock-throwing 
example with the probabilities given above. Consider the causal setting where Suzy and Billy 
both throw. We may be interested in knowing the probability that Suzy or Billy will be a cause 
of the bottle shattering if we don’t know whether the bottle will in fact shatter, or in knowing 
the probability that Suzy or Billy is the cause if we know that the bottle actually did shatter. 
(Notice that in the former case, we are asking about an effect of a (possible) cause, whereas in 
the latter case, we are asking about the cause of an effect.) When we pull out the probability, 
there are four causal models that arise here, depending on whether Suzy’s rock hits the bottle 
and whether Billy’s rock hits the bottle if Suzy’s rock does not hit. Specifically, consider the 
following four models, in a context where both Suzy and Billy throw rocks: 


= M,: Suzy’s rock hits and Billy’s rock would hit if Suzy missed; 

= Mp: Suzy’s rock hits but Billy’s rock would not hit if Suzy missed; 
= M3: Suzy’s rock misses and Billy’s rock hits; 

=s M4: neither rock hits. 


Note that these models differ in their structural equations, although they involve the same 
exogenous and endogenous variables. (/; is just the model considered earlier, described in 
Figure 2.3. Suzy’s throw is a cause in MM, and Mg; Billy’s throw is a cause in M3. 

If we know that the bottle shattered, Suzy’s rock hit, and Billy’s didn’t, then the probability 
that Suzy is the cause is |. We are conditioning on the model being either 4; or M2 because 
these are the models where Suzy’s rock hits, and Suzy is the cause of the bottle shattering in 
both. But suppose that all we know is that the bottle shattered, and we are trying to determine 
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the probability that Suzy is the cause. Now we can condition on the model being either M/;, 
Mp, or Ms, so if the prior probability of M!4 is p, then the probability that Suzy is a cause of 
the bottle shattering is .9/(1 — p): Suzy is a cause of the bottle shattering as long as she hits 
the bottle and it shatters (i.e., in models M/; and M2). Unfortunately, the story does not give 
us that the probability of M4. Itis .1 — g, where q is the prior probability of M3, but the story 
does not tell us that either. We can say more if we make a reasonable additional assumption: 
that whether Suzy’s rock hits is independent of whether Billy’s rock would hit if Suzy missed. 
Then M; has probability .72 (= .9 x .8), Mz has probability .18, M3 has probability .08, and 
Mz has probability .02. With this assumption and no further information, the probability of 
Suzy being a cause of the bottle shatter is .9/.98 = 45/49. 

Although it seems reasonable to view the outcomes “Suzy’s rock hits the bottle” and 
“Billy’s rock would have hit the bottle if Suzy’s hadn’t” as independent, they certainly do 
not have to be. We could, for example, assume that Billy and Suzy are perfectly accurate 
when there are no wind gusts and miss when there are wind gusts, and that wind gusts occur 
with probability .7. Then it would be the case that Suzy’s rock hits with probability .7 and 
that Billy’s rock would have hit with probability .7 if Suzy’s had missed, but there would have 
been only two causal models: the one that we considered in the deterministic case, which 
would be the actual model with probability .7, and the one where both miss. (A better model 
of this situation might have an exogenous variable U’ such that U’ = 1 if there are wind gusts 
and U’ = 0 if there aren’t, and then have deterministic equations for SH and BH.) The key 
message here is that, in general, further information might be needed to determine the prob- 
ability of causality beyond what is provided by probabilistic structural equations as defined 
above. 

A few more examples should help illustrate how this approach works and how much we 
can say. 


Example 2.5.1 Suppose that a doctor treats Billy on Monday. The treatment will result in 
Billy recovering on Tuesday with probability .9. But Billy might also recover with no treat- 
ment due to some other factors, with independent probability .1. By “independent probability” 
here, I mean that conditional on Billy being treated on Monday, the events “Billy recovers due 
to the treatment” and “Billy recovers due to other factors” are independent; thus, the probabil- 
ity that Billy will not recover on Tuesday given that he is treated on Monday is (.1)(.9). Now, 
in fact, Billy is treated on Monday and he recovers on Tuesday. Is the treatment the cause of 
Billy getting better? 

The standard answer is yes, because the treatment greatly increases the likelihood of Billy 
getting better. But there is still the nagging concern that Billy’s recovery was not due to the 
treatment. He might have gotten better if he had just been left alone. 

Formally, we have a causal model with three endogenous binary variables: MT (the doctor 
treats Billy on Monday), OF (the other factors that would make Billy get better occur), and 
BMC (Billy’s medical condition, which we can take to be | if he gets better and 0 otherwise). 
The exogenous variable(s) determine MT and OF. The probability that MT = 1 does not 
matter for this analysis; for definiteness, assume that MT = 1 with probability .8. According 
to the story, OF = 1 with probability .9. Finally, BMC = lif OF = 1, BMC = Oif 
MT = OF =0, and BMC = 1 with probability .9 if MT = 1 and OF = 0. Again, after 
we pull out the probability, there are eight models, which differ in whether IT is set to 0 or 
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1, whether OF is set to 0 or 1, and whether BMC = 1 when MT = 1 and OF = 0. Treating 
these choices as independent (as is suggested by the story), it is straightforward to calculate 
the probability of each of the eight models. Since we know that Billy actually did get better 
and was treated by his doctor, we ignore the four models where /T' = 0 and the model where 
MT = 1, OF = 0, and BMC = 0if MT = 1 and OF = 0. The remaining three models 
have probability .8(1—(.9)(.1)). MT = lis acause of BMC = 0 in two of the three models; 
it is not a cause only in the model where OF = 1, MT = 1, but BMC would be 0 if OF 
were set to 0. This latter model has prior probability .8(.1)(.1). Straightforward calculations 
now show that the treatment cause of Billy’s recovery with probability Teen) = 90/91. 


Similarly, the probability that Billy’s recovery was caused by other factors is mwa a 


10/91. These probabilities sum to more than | because with probability aaa = 9/91, 
both are causes; Billy’s recovery is overdetermined. 

Now suppose that we can perform a (somewhat expensive) test to determine whether the 
doctor’s treatment actually caused Billy to get better. Again, the doctor treats Billy and Billy 
recovers, but the test is carried out, and it is shown that the treatment was not the cause. It is 
still the case that the treatment significantly raises the probability of Billy recovering, but now 
we certainly don’t want to call the treatment the cause. And, indeed, it is not. Conditioning 
on the information, we can discard three of the four models. The only one left is the one in 
which Billy’s recovery is due to other factors (with probability 1). Moreover, this conclusion 
does not depend on the independence assumption. If 


Example 2.5.2 Now suppose that there is another doctor who can treat Billy on Monday, 
with different medication. Both treatments are individually effective with probability .9. Un- 
fortunately, if Billy gets both treatments and both are effective, then, with probability .8, there 
will be a bad interaction, and Billy will die. For this version of the story, suppose that there are 
no other factors involved; if Billy gets neither treatment, he will not recover. Further suppose 
that we can perform a test to see whether a treatment was effective. Suppose that in fact both 
doctors treat Billy and, despite that, Billy recovers on Tuesday. Without further information, 
there are three relevant models: in the first, only the first doctor’s medication was effective; in 
the second, only the second doctor’s medication was effective; in the third, both are effective. 
If further testing shows that both treatments were effective, then there is no uncertainty; we 
are down to just one model: the third one. According to the original and updated HP def- 
inition, both treatments are the cause of Billy’s recovery, with probability 1. (According to 
the modified HP definition, they are both parts of the cause.) This is true despite the fact that, 
given that the first doctor treated Billy, the second doctor treating Billy significantly Jowers the 
probability of Billy recovering (and symmetrically with the roles of the doctors reversed). fl 


Example 2.5.3 Now suppose that a doctor has treated 1,000 patients. They each would have 
had probability .9 of recovering even without the treatment. With the treatment, they recover 
with independent probability .1 In fact, 908 patients recovered, and Billy is one of them. What 
is the probability that the treatment was the cause of Billy’s recovery? Calculations similar 
to those in Example 2.5.1 show that it is 10/91. Standard probabilistic calculations show that 
there is quite a high probability that the treatment is a cause of at least one patient’s recovery. 
Indeed, there is quite a high probability that the treatment is the unique cause of at least one 
patient’s recovery. But we do not know which patient that is. ff 
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Example 2.5.4 Again, Billy and Suzy throw rocks at a bottle. If either Billy or Suzy throws a 
rock, it will hit the bottle. But now the bottle is heavy and will, in general, not topple over if it 
is hit. If only Billy hits the bottle, it will topple over with probability .2; similarly, if only Suzy 
hits the bottle, it will topple over with probability .2. If they hit the bottle simultaneously, it 
will topple with probability .7. In fact, both Billy and Suzy throw rocks at the bottle, and it 
topples over. We are interested in the extent to which Suzy and Billy are the causes of the 
bottle toppling over. 

For simplicity, I restrict attention to contexts where both Billy and Suzy throw, both hit, 
and the bottle topples over. After pulling out the probability, there are four causal models of 
interest, depending on what would have happened if just Suzy had hit the bottle and if just 
Billy had hit the bottle: 


= \M,: the bottle would have toppled over if either only Suzy’s rock had hit it or if only 
Billy’s rock had hit it; 


= Mp: the bottle would have toppled over if only Suzy’s rock had hit it but not if only 
Billy’s rock had hit it; 


= M3: the bottle would have toppled over if only Billy’s rock had hit it but not if only 
Suzy’s rock had hit it; 


= M,: the bottle would not have toppled over if either only Billy’s rock or only Suzy’s 
rock had hit it. 


I further simplify by treating the outcomes “the bottle would have toppled over had only 
Suzy’s rock hit it” and “the bottle would have toppled over had only Billy’s rock hit it” as 
independent (even though neither may be independent of the outcome “the bottle would have 
toppled over had both Billy and Suzy’s rocks hit it”). With this assumption, the probability 
of MM, (conditional on the bottle toppling over if both Suzy’s and Billy’s rock hit it) is .04, 
the conditional probability of M2 is .16, the conditional probability of M3 is .16, and the 
conditional probability of M4 is .64. 

It is easy to check that Suzy’s throw is a cause of the bottle toppling in models 141, Mo, and 
Mg, but not in M3. Similarly, Billy’s throw is a cause in models M,, M3, and M4. They are 
both causes in models M4, and M, according to the original and updated HP definition, as well 
as in M4 according to the modified HP definition; in 1/,, the conjunction ST = 1A BT =1 
is a cause according to the modified HP definition. Thus, the probability that Suzy’s throw is 
part of a cause of the bottle shattering is .84, and likewise for Billy. The probability that both 
are parts of causes is .68 (since this is the case in models Md, and M,). I 


As these examples show, the HP definition lets us make perfectly sensible statements about 
the probability of causality. There is nothing special about these examples; all other examples 
can be handled equally well. Perhaps the key message here is that there is no need to work 
hard on getting a definition of probabilistic causality, at least at the macroscopic level; it 
suffices to get a good definition of deterministic causality. But what about if we want to treat 
events at the quantum level, which seems inherently probabilistic? A case can still be made 
that it is psychologically useful to think deterministically. Whether the HP definitions can be 
extended successfully to an inherently probabilistic framework still remains open. 
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Even if we ignore issues at the microscopic level, there is still an important question to be 
addressed: In what sense is the single model with probabilistic equations equivalent to the 
set of deterministic causal models with a probability on them? In a naive sense, the answer 
is no. The examples above show that the probabilistic equations do not in general suffice to 
determine the probability of the deterministic causal models. Thus, there is a sense in which 
deterministic causal models with a probability on them carry more information than a single 
probabilistic model. Moreover, this extra information is useful. As we have seen, it may be 
necessary to determine the answer to questions such as “How probable is it that Billy is the 
cause of the bottle shattering?” when all we see is a shattered bottle and know that Billy and 
Suzy both threw. Such a determination can be quite important in, for example, legal cases. 
Although bottle shattering may not be of great legal interest, we may well want to know the 
probability of Bob being the cause of Charlie dying in a case where there are multiple potential 
causes. 


If we make reasonable independence assumptions, then a probabilistic model determines a 
probability on deterministic models. Moreover, under the obvious assumptions regarding the 
semantics of causal models with probabilistic equations, formulas in the language described 
in Section 2.2.1 have the same probability of being true under both approaches. Consider 
Example 2.5.4 again. What is the probability of a formula like [ST < 0](BS = 1) (if Suzy 
hadn’t thrown, the bottle would have shattered)? Clearly, of the four deterministic models 
described in the discussion of this example, the formula is true only in 7; and M3; thus, it 
has probability .2. Although I have not given a semantics for formulas in this language in 
probabilistic causal models (i.e., models where the equations are probabilistic), if we assume 
that the outcomes “the bottle would have toppled over had only Suzy’s rock hit it” and “the 
bottle would have toppled over had only Billy’s rock hit it” are independent, as I assumed 
when assigning probabilities to the deterministic models, we would expect this statement to 
also be true in the probabilistic model. The converse holds as well: if a statement about the 
probability of causality holds in the probabilistic model, then it holds in the corresponding 
deterministic model. 


This means that (under the independence assumption) as far as the causal language of Sec- 
tion 2.2.1 is concerned, the two approaches are equivalent. Put another way, to distinguish the 
two approaches, we need a richer language. More generally, although we can debate whether 
pulling out the probability “really” gives an equivalent formulation of the situation, to the ex- 
tent that the notion of (probabilistic) causality is expressible in the language of Section 2.2.1, 
the two approaches are equivalent. In this sense, it is safe to reduce to deterministic models. 


This observation leads to two further points. Suppose that we do not want to make the 
relevant independence assumptions. Is there a natural way to augment probabilistic causal 
models to represent this information (beyond just going directly to a probability over deter- 
ministic causal models)? See Section 5.1 for further discussion of this point. 


Of course, we do not have to use probability to represent uncertainty; other representations 
of uncertainty could be used as well. Indeed, there is a benefit to using a representation 
that involves sets of probabilities. Since, in general, probabilistic structural equations do 
not determine the probability of the deterministic models that arise when we pull out the 
probability, all we have is a set of possible probabilities on the deterministic models. Given 
that, it seems reasonable to start with sets of probabilities on the equations as well. This 
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makes sense if, for example, we do not know the exact probability that Suzy’s rock would hit 
the bottle, but we know that it would hit it with probability between .6 and .8. It seems that 
the same general approach of pulling out the probabilities works if we represent uncertainty 
using sets of probabilities, but I have not explored this approach in detail. 


2.6 Sufficient Causality 


Although counterfactual dependence is a key feature of causality, people’s causality judg- 
ments are clearly influenced by another quite different feature: how sensitive the causality 
ascription is to changes in various other factors. If we assume that Suzy is very accurate, then 
her throwing a rock is a robust cause of the bottle shattering. The bottle will shatter regardless 
of whether Billy throws, even if the bottle is in a slightly different position and even if it is 
slightly windy. On the other hand, suppose instead that we consider a long causal chain like 
the following: Suzy throws a rock at a lock, causing it to open, causing the lion that was in 
the locked cage to escape, frightening the cat, which leapt up on to the table and knocked over 
the bottle, which then shattered. Although Suzy’s throw is still a but-for cause of the bottle 
shattering, the shattering is clearly sensitive to many other factors. If Suzy’s throw had not 
broken the lock, if the lion had run in a different direction, or if the cat had not jumped on 
the table, then the bottle would not have shattered. People seem inclined to assign less blame 
to Suzy’s throw in this case. I return to the issue of long causal chains in Sections 3.4.4 and 
provide a solution to it using the notion of blame in Section 6.2. Here I define a notion of suf- 
ficient causality that captures some of the intuitions behind insensitive causation. Sufficient 
causality also turns out to be related to the notion of explanation that I define in Chapter 7, so 
it is worth considering in its own right, although it will not be a focus of this book. 

The key intuition behind the definition of sufficient causality is that not only does X= 
suffice to bring about y in the actual context (which is the intuition that AC2(b°) and AC2(b”) 
are trying to capture), but it also brings it about in other “nearby” contexts. Since the frame- 
work does not provide a metric on contexts, there is no obvious way to define nearby context. 
Thus, in the formal definition below, I start by considering all contexts. Since conjunction 
plays a somewhat different role in this definition than it does in the definition of actual causal- 
ity (particularly in the modified HP definition), I take seriously here the intuition that what we 
really care about are parts of causes, rather than causes. 

Again, there are three variants of the notion of sufficient causality, depending on which 
version of AC2 we use. The differences between these variants do not play a role in our 
discussion, so I just write AC2 below. Of course, if we want to distinguish, AC2 below can 
be replaced by either AC2(a) and AC2(b°), AC2(a) and AC2(b“), or AC2(a”’), depending on 
which version of the HP definition is being considered. 


Definition 2.6.1 X = is a sufficient cause of y in the causal setting (M, i) if the following 
conditions hold: 


SCl. (M, i) K (X = #) and (M,@) Ey. 


SC2. Some conjunct of X =fis part of a cause of y in (M, vz). More precisely, there exists 
aconjunct X = x of X = X and another (possibly empty) conjunction Y = jj such 
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cy = y is a cause of y in (MM, wv); that is, AC1, AC2, and AC3 hold for 


SC3. (M, i’) E [X « Z]y for all contexts 7. 


SC4. X is minimal; there is no strict subset XY’ of X such that X’ = 7’ satisfies conditions 
SC1, SC2, and SC3, where 2” is the restriction of # to the variables in X. JJ 


SC3 is the key condition here; it says that X = Z suffices to bring about ¢ in all contexts. 
Thus, in the case of the 1 1—0 victory, any set of six voters becomes a sufficient cause (no mat- 
ter which variant of the HP definition we use), assuming that there are contexts corresponding 
to all possible voting configurations. This suggests that sufficient causality is related to the 
modified HP definition, which, in this case, would also take any subset of six voters to be a 
cause. However, this is misleading, as the next example shows. 

Consider the forest-fire example again. In the disjunctive model, each of L = 1 and 
MD = 1 is a sufficient cause; in the conjunctive model, L = 1A MD = 1 is a sufficient 
cause, assuming that there is a context where there is no lightning and another where the 
arsonist does not drop a match. This is the case for all variants of the HP definition. Recall 
that this is just the opposite of what the modified HP definition does with actual causality; it 
would declare L = 1 A MD = 1 the cause in the disjunctive model and both L = 1 and 
MD = 1 individually causes in the conjunctive model. 

I wrote SC2 as I did so as to take seriously the notion that all we really care about when 
using the modified (and updated) definition is parts of causes. Thus, in the disjunctive model 
for the forest fire example, L = 1 and MD = 1 are both sufficient causes of the forest 
fire in the context (1,1) where there is both a lightning strike and a dropped match, even 
if we use AC2(a™) in SC2. By assumption, SC3 holds: both [L « 1|(FF = 1) and 
[MD + 1]|(FF = 1) hold in all contexts. Moreover, both L = 1 and MD = 1 are parts 
of a cause of F'F = 1 (namely, L = 1 A MD = 1), so SC2 holds. 

The forest-fire example shows that sufficient causality lets us distinguish what can be called 
joint causes from independent causes. In the disjunctive forest-fire model, the lightning and 
the dropped match can each be viewed as independent causes of the fire; each suffices to 
bring it about. In the conjunctive model, the lightning and the dropped match are joint causes; 
their joint action is needed to bring about the forest fire. The distinction between joint and 
independent causality seems to be one that people are quite sensitive to. Not surprisingly, it 
plays a role in legal judgments. 

Going on with examples, consider the sophisticated rock-throwing model M),, from Ex- 
ample 2.3.3 again. Assume that Suzy is always accurate; that is, she is accurate in all contexts 
where she throws, and would be accurate even in contexts where she doesn’t throw if (coun- 
terfactually) she were to throw; that is, assume that [S'7' < 1](BS' = 1) holds in all contexts, 
even if Suzy does not actually throw. Then Suzy’s throw is a sufficient cause of the bottle 
shattering in the context that we have been considering, where both Billy and Suzy throw, 
and Suzy’s throw actually hits the bottle. Billy’s throw is not a sufficient cause in this context 
because it is not even (part of) a cause. However, if we consider the model Mj, which has a 
larger set of contexts, including ones where Suzy might throw and miss, then ST = 1 is nota 
sufficient cause of BS' = 1 in the context u* where Suzy is accurate and her rock hits before 
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Billy’s, but SH = 1 is. Recall that ST = 1 remains a cause of BS = 1 in (Mp7, u*). This 
shows that (part of) a cause is not necessarily part of a sufficient cause. 


In the double-prevention example (Example 2.3.4), assuming that Suzy is accurate, 
SBT = 1 (Suzy bombing the target) is a sufficient cause of TD = 1 (the target being 
destroyed). In contrast, Billy pulling the trigger and shooting down Enemy is not, if there are 
contexts where Suzy does not go up, and thus does not destroy the target. More generally, a 
but-for cause at the beginning of a long causal chain is unlikely to be a sufficient cause. 


The definition of actual causality (Definition 2.2.1) focuses on the actual context; the set 
of possible contexts plays no role. By way of contrast, the set of contexts plays a major 
role in the definition of sufficient causality. The use of SC3 makes sufficient causality quite 
sensitive to the choice of the set of possible contexts. It also may make it an unreasonably 
strong requirement in some cases. Do we want to require SC3 to hold even in some extremely 
unlikely contexts? A sensible way to weaken SC3 would be to add probability to the picture, 
especially if we have a probability on contexts, as in Section 2.5. Rather than requiring SC3 to 
hold for all contexts, we could then consider the probability of the set of contexts for which it 
holds. That is, we can take X = 2 to be a sufficient cause of y with probability o in (M, it) if 
SC1, SC2, and SC4 hold and the set of contexts for which SC3 holds has probability at least a. 


Notice that with this change, there is a tradeoff between minimality and probability of 
sufficiency. Consider Suzy throwing rocks. Suppose that, although Suzy is quite accurate, 
there is a (quite unlikely) context where there are high winds, so Suzy misses the bottle. In 
this model, a sufficient cause for the bottle shattering is “Suzy throws and there are no high 
winds”. Suzy throwing is not by itself sufficient. But it is sufficient in all contexts but the one 
in which there are high winds. If the context with high winds has probability .1, then Suzy’s 
throw is a sufficient cause of the bottle shattering with probability .9. 


This tradeoff between minimality and probability of sufficiency comes up again in the 
context of the definition of explanation, considered in Section 7.1, so I do not dwell on it 
here. But thinking in terms of probability provides some insight into whether not performing 
an action should be treated as a sufficient cause. Consider Example 2.3.5, where Billy is 
hospitalized. Besides Billy’s doctor, there are other doctors at the hospital who could, in 
principle, treat Billy. But they are unlikely to do so. Indeed, they will not even check up on 
Billy unless they are told to; they presumably have other priorities. 


Now consider a context where in fact no doctor treats Billy. In that case, Billy’s doctor 
not treating Billy on Monday is a sufficient cause for Billy feeling sick on Tuesday with high 
probability because, with high probability, no other doctor would treat him either. However, 
another doctor not treating Billy has only a low probability of being a sufficient cause for 
Billy feeling sick on Tuesday because, with high probability, Billy’s doctor does treat him. 
This intuition is similar in spirit to the way normality considerations are brought to bear on 
this example in Chapter 3 (see Example 3.2.2). 


For the most part, I do not focus on sufficient causality in this book. I think that notions 
such as normality, blame, and explanation capture many of the same intuitions as sufficient 
causality in an arguably better way. However, it is worth keeping sufficient causality in mind 
when we consider these other notions. 
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2.7 Causality in Nonrecursive Models 


Up to now, I have considered causality only in recursive (i.e., acyclic) causal models. Recur- 
sive models are also my focus throughout the rest of the book. However, there are examples 
that arguably involve nonrecursive models. For example, the connection between pressure 
and volume is given by Boyle’s Law, which says that PV = k: the product of pressure and 
volume is a constant (if the temperature remains constant). Thus, increasing the pressure de- 
creases the volume, and decreasing the volume increases the pressure. There is no direction 
of causality in this equation. However, in any particular setting, we are typically manipulating 
either the pressure or the volume, so there is (in that setting) a “direction” of causality. 

Perhaps a more interesting example is one of collusion. Imagine two arsonists who agree 
to burn down the forest, when only one match suffices to burn down the forest. (So we are 
essentially in the disjunctive model.) However, the arsonists make it clear to each other that 
each will not drop a lighted match unless the other does. At the critical moment, they look 
in each others’ eyes, and both drop their lit matches, resulting in the forest burning down. In 
this setting, it is plausible that each arsonist caused the other to throw the match, and together 
they caused the forest to burn down. That is, if 7D; stands for “arsonist 2 drops a lit match’, 
for 2 = 1,2, and U is an exogenous variable that determines whether the arsonists initially 
intend to drop the match, we can take the equation for I/D, to be such that MD, = 1 if and 
only if U = 1 and MD, = 1, and similarly for MD 2. This model is not recursive; in all 
contexts, I/D, depends on MD2 and MDz2 depends on MD,. In this section, I show how the 
HP definition(s) can be extended to deal with such nonrecursive models. 

In nonrecursive models, there may be more than one solution to an equation in a given 
context, or there may be none. In particular, this means that a context no longer necessarily 
determines the values of the endogenous variables. Earlier, I identified a primitive event 
such as X = x with the basic causal formula [](X = 2), that is, a formula of the form 
[Yi < yi,---, Ve < ye|(X = x) with k = 0. In recursive causal models, where there is 
a unique solution to the equations in a given context, we can define [](X = x) to be true in 
(M, v) if X = x is the unique solution to the equations in context w. It seems reasonable to 
identify [](X = x) with X = z in this case. But it is not so reasonable if there may be several 
solutions to the equations or none. 

What we really want to do is to be able to say that XY = x under a particular setting of the 
variables. Thus, we now take the truth of a primitive event such as X = z to be relative not 
just to a context but to a complete assignment: a complete description (u,v) of the values of 
both the exogenous and the endogenous variables. As before, the context w assigns a value 
to all the exogenous variables; v assigns a value to all the endogenous variables. Now define 
(M, u,v) | X = vif X has value x in U. Since the truth of X = x depends on just @, not u, 
I sometimes write (M,v) = X = x. 

This definition can be extended to Boolean combinations of primitive events in the standard 
way. Define (M,i,@) ] [Y < gly iff (M, 0’) K y for all solutions (i, &”) to the equations 
in My ,_,. Since the truth of [Y + y|(X = x) depends only on the setting w of the exogenous 


variables, and not on @, I write (M, i) / [Y < g](X = 2) to emphasize this. 
With these definitions in hand, it is easy to extend the HP definition of causality to arbitrary 
models. Causality is now defined with respect to a tuple (V, v, v) because we need to know 
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the values of the endogenous variables to determine the truth of some formulas. That is, we 
now talk about X = # being an actual cause of in (M, ii, v). As before, we have conditions 
AC1-AC3. ACI and AC3 remain unchanged. AC2 depends on whether we want to consider 
the original, updated, or modified definition. For the updated definition, we have: 


AC2. There exists a partition (Z,W) of V with X C Z and some setting (", a’) of the 
variables in (X , W) such that 
(a) (M,@) EXC? Wed 


(b") If Z* is such that (IM, u,v) 
Z—X, (Ma) E[XeZgBW +d, 27 + ZIp. 


In recursive models, the formula a[X + 2, We w' |p in the version of AC2(a) above is 
equivalent to [X < 2”, W < w’|n9, the formula used in the original formulation of AC2(a). 
Given a recursive model M, there is a unique solution to the equations in Mz Xee Woo 
—y holds in that solution iff y~ does not hold. However, with nonrecursive models, there may 
several solutions; AC2(a) holds if y~ does not hold in at least one of them. By way of contrast, 


U 
for AC2(b") to hold, y must stay true in all solutions to the equations in M keawea fem 


in context wu. AC2(a) already says that there is some contingency in which setting Xo 
is necessary to bring about ¢ (since setting X to something other than Z in that contingency 
may result in y no longer holding). Requiring —y to hold in only one solution to the equations 
seems in the spirit of the necessity requirement. However, AC2(b”) says that setting X to Zis 
sufficient to bring about ¢ in all relevant settings, so it makes sense that we want y to hold in 
all solutions. That’s part of the intuition for sufficiency. Clearly, in the recursive case, where 
there is only one solution, this definition agrees with the definition given earlier. 

Analogous changes are required to extend the original and modified HP definition to the 
nonrecursive case. For AC2(b"), we require only that for all subsets Z of Z — X, we have 
(M, ii) K [X — 2,W < w&,Z' — Z*y, and do not require this to hold for all subsets W’ of 
W; for AC2(a™), we require that w consist of the values of the variables in W in the actual 
context. Again, these definitions agree with the earlier definitions in the recursive case. 

Note that, with these definitions, all definitions agree that MD, = 1 and MD2 = 1 are 
causes of the fire in the collusive disjunctive scenario discussed above. This is perhaps most 
interesting in the case of the modified HP definition, where in the original disjunctive scenario, 
the cause was the conjunction MD = 1A L = 1. However, in this case, each arsonist is a 
but-for cause of the fire. Although it takes only one match to start the fire, if arsonist 1 does 
not drop his match, then arsonist 2 won’t drop his match either, so there will not be a fire. The 
key point is that the arsonists do not act independently here, whereas in the original story, the 
arsonist and lightning were viewed as independent. 

Interestingly, we can make even the collusion story recursive. Suppose that we add a 
variable EC’ for eye contact (allowing for the possibility that the arsonists avoid each other’s 
eyes), where the equations are such that each drops the match only if they do indeed have 
eye contact. Now we are back at a model where each arsonist is (separately) a cause of the 
fire (as is the eye contact) according to the original and updated HP definition, and MD, = 
1A MD2 = 1 is a cause according to the modified HP definition. We can debate which is the 
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“better” model here. But I have no reason to believe that we can always replace a nonrecursive 
model by a recursive model and still describe essentially the same situation. 

Finally, it is worth noting that causality is not necessarily asymmetric in nonrecursive mod- 
els; we can have X = «x being a cause of Y = y and Y = y being acause of X = 2. For 
example, in the model of the collusive story above (without the variable EC), MD, = lisa 
cause of MD2 = 1 and MD2 = 1is acause of MD, = 1. It should not be so surprising that 
we can have such symmetric causality relations in nonrecursive models. However, it is easy 
to see that causality is asymmetric in recursive models, according to all the variants of the HP 
definition. AC2(a) and AC2(a™) guarantee that if XY = x is a cause of Y = y in (IM, wu), then 
Y depends on X in context wu. Similarly, X must depend on Y if Y = yisacause of X = x 
in (Mu). This cannot be the case in a recursive model. 

Although the definition of causality in nonrecursive models given here seems like the most 
natural generalization of the definition for recursive models, I do not have convincing exam- 
ples suggesting that this is the “right” definition, nor do I have examples showing that this 
definition is problematic in some ways. This is because all the standard examples are most 
naturally modeled using recursive models. It would be useful to have more examples of sce- 
narios that are best captured by nonrecursive models for which we have reasonable intuitions 
regarding the ascription of causality. 


2.8 AC2(b°) vs. AC2(b") 


In this section, I consider more carefully AC2(b°) and AC2(b"), to give the reader a sense of 
the subtle differences between original and updated HP definition. I start with the example 
that originally motivated replacing AC2(b°) by AC2(b”). 


Example 2.8.1 Suppose that a prisoner dies either if A loads B’s gun and B shoots or if 
C loads and shoots his gun. Taking D to represent the prisoner’s death and making the 
obvious assumptions about the meaning of the variables, we have that D= (AA B)VC. 
(Note that here I am identifying the binary variables A, B, C, and D with primitive propo- 
sitions in propositional logic, as I said earlier I would. I could have also written this as 
D = max(min(A, B),C).) Suppose that in the actual context u, A loads B’s gun, B 
does not shoot, but C' does load and shoot his gun, so the prisoner dies. That is, A = 1, 
B = 0, and C = 1. Clearly C = 1 is a cause of D = 1. We would not want 
to say that A = 1 is a cause of D = 1, given that B did not shoot (i.e., given that 
B = 0). However, suppose that we take the obvious model with the variables A, B, C, 
D. With AC2(b°), A = 1 is a cause of D = 1. For we can take W = {B,C} and 
consider the contingency where B = 1 and C = 0. It is easy to check that AC2(a) 
and AC2(b°) hold for this contingency, since (M,u) —- [A+ 0,B<1,C «+ 0|(D = 0), 
whereas (M,u) — [A+ 1,B¢1,C <0](D=1). Thus, according to the original HP 
definition, A = 1 is a cause of D = 1. However, AC2(b“) fails in this case because 
(M,u) — [A + 1,C < 0|(D = 0). The key point is that AC2(b") says that for A = 1 
to be a cause of D = 1, it must be the case that D = 1 even if only some of the values in W 
are set to their values w. In this case, by setting only A to 1 and leaving B unset, B takes on 
its original value of 0, in which case D = 0. AC2(b°) does not consider this case. 
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Note that the modified HP definition also does not call A = 1 a cause of D = 1. This fol- 
lows from Theorem 2.2.3 and the observations above; it can also be seen easily by observing 
that if the values of any subset of variables are fixed at their actual values in context u, setting 
A = 0 will not make D = 0. 

Although it seems clear that we do not want A = 1 to be a cause of D = 1, the situation 
for the original HP definition is not as bleak as it may appear. We can deal with this example 
using the original HP definition if we add extra variables, as we did with the Billy-Suzy rock- 
throwing example. Specifically, suppose that we add a variable B’ for ““B shoots a loaded 
gun”, where B’ = AA Band D = B’ V C. This slight change prevents A = 1 from being 
a cause of D = 1, even according to the original HP definition. I leave it to the reader to 
check that any attempt to now declare A = 1 a cause of D = 1 according to the original 
HP definition would have to put B’ into GZ. However, because B’ = 0 in the actual world, 
AC2(b°) does not hold. As I show in Section 4.3, this approach to dealing with the problem 
generalizes. There is a sense in which, by adding enough extra variables, we can always use 
AC2(b°) instead of AC2(b“) to get equivalent judgments of causality. ff 


There are advantages to using AC2(b°). For one thing, it follows from Theorem 2.2.3(d) 
that with AC2(b?), causes are always single conjuncts. This is not the case with AC2(b“), as 
the following example shows. 


Example 2.8.2 A votes for a candidate. A’s vote is recorded in two optical scanners B and 
C. D collects the output of the scanners; D’ records whether just scanner B records a vote for 
the candidate. The candidate wins (i.e, WIN = 1) if any of A, D, or D’ is 1. The value of A 
is determined by the exogenous variable. The following structural equations characterize the 
value of the remaining variables: 


»# B=A; 
"C=A; 
#®D=BAC; 


» D!/ = BA-7A; and 
» WIN=AVDVD". 


Call this causal model My. Roughly speaking, D’ acts like BH in the rock-throwing example 
as modeled by Mj. The causal network for My is shown in Figure 2.9. 

In the actual context u, A = 1, so B C D WIN 1 and D’ = 0. I claim 
that B = 1A C = 1isa cause of WIN = 1 in (My,u) according to the updated HP 
definition (which means that, by AC3, neither B = 1 nor C = 1 is cause of WIN = 1 in 
(My,u)). To see this, first observe that ACI clearly holds. For AC2, consider the witness 
where W = {A} (so Z = {B,C,D,D’, WIN}) and w = 0 (so we are considering the 
contingency where A = 0). Clearly, (My,u) — [A — 0,B + 0,C « O]( WIN = 0), so 
AC2(a) holds, and (My,u) — [A+ 0,B<1,C + 1](WIN = 1). Moreover, (My, wu) 
[B+ 1,C © 1](WIN = 1), and WIN = 1 continues to hold even if D is set to 1 and/or D’ 
is set to 0 (their values in (My, w)). Thus, AC2 holds. 
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WIN 
Figure 2.9: Another voting scenario. 


It remains to show that AC3 holds and, in particular, that neither B = 1 nor C = 1 is 
a cause of WIN = 1 in (My, wu) according to the updated HP definition. The argument is 
essentially the same for both B = 1 and C = 1, so I just show it for B = 1. Roughly 
speaking, B = 1 does not satisfy AC2(a) and AC2(b”) for the same reason that BT’ = 1 
does not satisfy it in the rock-throwing example. For suppose that B = 1 satisfies AC2(a) 
and AC2(b“). Then we would have to have A € W, and we would need to consider the 
contingency where A = 0 (otherwise WIN = 1 no matter how we set B). Now we need 
to consider two cases: D’! € W and D’ € Z. If D’ € W, then if we set D’ = 0, we have 
(My,u) — [A+ 0,B «+ 1,D’ + O|(WIN = 0), so AC2(b”) fails (no matter whether 
C and D are in W or Z). And if we set D’ = 1, then AC2(a) fails, since (My,u) - 
[A+ 0,B «0, D! < 1](WIN = 1). Now if D’ € Z, note that (My,u) — D’ = 0. Since 
(My,u) = [A+ 0,B «+ 1,D’ < O|(WIN = 0), again AC2(b") fails (no matter whether 
C or Dare in W or Z). Thus, B = 1 is not acause of WIN = 1 in (My, u) according to the 
updated HP definition. 

By way of contrast, B = 1 and C = 1 are causes of WIN = 1 in (My, u) according to 
the original HP definition (as we would expect from Theorem 2.2.3). Although B = 1 and 
C = 1 do not satisfy AC2(b"), they do satisfy AC2(b°). To see that B = 1 satisfies AC2(b°), 
take Z — {B, D, WIN} and consider the witness where A = 0, C = 1, and D’ = 0. Clearly 
we have both that (My,u) — [B + 0,A «+ 0,C + 1,D’ © O|(WIN = 0) and that 
(My,u) — [B«+1,A<0,C —1,D' — 0](WIN = 1), and this continues to hold if D is 
set to | (its original value). A similar argument shows that C' = 1 is a cause. 

Interestingly, none of B = 1,C = 1,or B=1AC = Lisacause of WIN = Lin (My, u) 
according to the modified HP definition. This follows from Theorem 2.2.3; if B = 1 were 
part of a cause according to the modified HP definition, then it would be a cause according 
to the updated HP definition, and similarly for C = 1. It can also be seen directly: setting 
B=C = O has no effect on WIN if A = 1. The only cause of WIN = 1 in (My, u) 
according to the modified HP definition is A = 1 (which, of course, is also a cause according 
to the original and updated definitions). If 


The fact that causes may not be singletons has implications for the difficulty of determining 
whether A is an actual cause of B (see Section 5.3). Although this may suggest that we should 
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then use the original HP definition, adding extra variables if needed, this conclusion is not 
necessarily warranted either. For one thing, it may not be obvious when modeling a problem 
that extra variables are needed. Moreover, the advantages of lower complexity may be lost 
if sufficiently many additional variables need to be added. The modified HP definition has 
the benefits of both typically giving the “right” answer without the need for extra variables 
(but see Section 4.3), while having lower complexity than either the original and modified HP 
definition (again, see Section 5.3). 

I conclude this section with one more example that I hope clarifies some of the details of 
AC2(b”) (and AC2(b°)). Recall that AC2(b”) requires that y remain true if X is set to £, 
even if only a subset of the variables in W are set to their values in «w and all the variables 
in an arbitrary subset Z' of Z are set to their original values 2™* (i.e., the values they had in 
the original context, where X= x). Since I have viewed the variables in Z as being on the 
causal path to y (although that is not quite right; see Section 2.9), we might hope that not 
only is setting X to Z sufficient to make yp true, but it is also enough to force all the variables 
in Z to their original values. Indeed, there is a definition of actual causality that makes this 
requirement (see the bibliographic notes). Formally, we might hope that if (M/, v7) Z=z, 
then the following holds: 


> 


(M, i) E [X — 2,W' © @(Z = 2*) forall Z € Z andall W’ CW. (2.1) 


It is not hard to show (see Lemma 2.10.2) that, in the presence of (2.1), AC2(b“) can be 
simplified to 


(M,i) kK [X + 2,W' «+ ay; 


there is no need to include the subsets Z’. Although (2.1) holds in many examples, it seems 
unreasonable to require it in general, as the following example shows. 


Example 2.8.3 Imagine that a vote takes place. For simplicity, two people vote. The measure 
is passed if at least one of them votes in favor. In fact, both of them vote in favor, and the 
measure passes. This version of the story is almost identical to the situation where either a 
match or a lightning strike suffices to bring about a forest fire. If we use V; and V2 to denote 
how the voters vote (V; = 0 if voter 7 votes against and V; = 1 if she votes in favor) and P 
to denote whether the measure passes (P = 1 if it passes, P = 0 if it doesn’t), then in the 
context where V; = V2 = 1, it is easy to see that each of V; = 1 and V2 = 1 is part of a cause 
of P = 1 according to the original and updated HP definition. However, suppose that the 
story is modified slightly. Now assume that there is a voting machine that tabulates the votes. 
Let T represent the total number of votes recorded by the machine. Clearly, T = V; + V2 and 
P =1iff T > 1. The causal network in Figure 2.10 represents this more refined version of 
the story. 

In this more refined scenario, V; = 1 and V2 = 1 are still both causes of P = 1 according 
to the original and updated HP definition. Consider V; = 1. Take Z= {V,,T, P} and 
W = Vs, and consider the contingency V2 = 0. With this witness, P is counterfactually 
dependent on V;, so AC2(a) holds. To check that this contingency satisfies AC2(b”) (and 
hence also AC2(b°)), note that setting V; to 1 and V2 to 0 results in P = 1, even if T is set to 
2 (its current value). However, (2.1) does not hold here: T' does not retain its original value of 
2 when V; = 1 and Vz = 0. 
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Yi V2 


P 


Figure 2.10: A more refined voting scenario. 


Since, in general, one can always imagine that a change in one variable produces some 
feeble change in another, it seems unreasonable to insist on the variables in Z remaining 
constant; AC2(b°) and AC2(b”) require merely that changes in Z not affect y. fl 


2.9 Causal Paths 


There is an intuition that causality travels along a path: A causes B, which causes C’, which 
causes D, so there is a path from A to B to C to D. And, indeed, many accounts of causality 
explicitly make use of causal paths. In the original and updated HP definition, AC2 involves 
a partition of the endogenous variables into two sets, Z and W. In principle, there are no 
constraints on what partition to use, other than the requirement that X be a subset of Z. 
However, when introducing AC2(a), I suggested that Z can be thought of as consisting of the 
variables on the causal path from X to y. (Recall that the notion of a causal path is defined 
formally in Section 2.4.) As I said, this intuition is not quite true for the updated HP definition; 
however, it is true for the original HP definition. In this section, I examine the role of causal 
paths in AC2. In showing that X is a cause of vy, I would like to show that we can take Z 
(or VY — W in the case of the modified HP definition) to consist only of variables that lie on 
a causal path from some variable in X to some variable in y. The following example shows 
that this is not the case in general with the updated HP definition. 


Example 2.9.1 Consider the following variant of the disjunctive forest-fire scenario. Now 
there is a second arsonist; he will drop a lit match exactly if the first one does. Thus, we add 
a new variable MD’ with the equation MD’ = MD. Moreover, FF is no longer a binary 
variable; it has three possible values: 0, 1, and 2. We have that FF = O0if L = MD = 
MD’ = 0; FF = 1 if either L = 1 and MD = MD’ or MD = MD’ = 1; and FF = 2 
if MD ~¢ MD’. We can suppose that the forest burns asymmetrically (FF = 2) if exactly 
one of the arsonists drops a lit match (which does not happen under normal circumstances); 
one side of the forest is more damaged than the other (even if the lightning strikes). Call the 
resulting causal model M{,,,. Figure 2.11 shows the causal network corresponding to Mjpy. 
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MD' 


PP 


Figure 2.11: The forest fire with two arsonists. 


In the context u where L = MD = 1, L = lisacause of FF = 1 according to the updated 
HP definition. Note that there is only one causal path from L to FF: the path consisting of 
just the variables L and FF’. But to show that L = 1 is a cause of F'F' = 1 according to the 
updated HP definition, we need to take the witness to be ({ MD}, 0,0); in particular, MD’ is 
part of Z, not W. 

To see that this witness works, note that (Mi,p,u) E [L «+ 0,MD + O|(FF = 0), 
so AC2(a) holds. Since we have both (Mhyp,u) — [L «— 1,MD < O|(FF = 1) and 
(Mpp,u) & [L < 1]\(FF = 1), AC2(b") also holds. However, suppose that we wanted 
to find a witness where the set W = { MD, MD’} (so that Z can be the unique causal path 
from L to FF’). What do we set MD and MD’ to in the witness? At least one of them must 
be 0, so that when L is set to 0, FF ¢ 1. Setting MD’ = 0 does not work: (Mfp,u) & 
[L + 1, MD' « 0|(FF = 2), so AC2(b“) does not hold. And setting MD = 0 and MD‘ = 1 
does not work either, for then FF = 2, independent of the value of L, so again AC2(b") does 
not hold. 

By way of contrast, we can take ({MD,MD'},0,0) to be a witness to L = 1 being 
a cause according to the original HP definition; AC2(b°) holds with this witness because 
(Min, u) —& (D+ 1,MD+<0,MD' « O|(FF = 1). 

Finally, perhaps disconcertingly, L = 1 is not even part of a cause of FF = 1 in (M, u) 
according to the modified HP definition. That is because I/D = 1 is now a cause; setting MD 
to 0 while keeping MD’ fixed at 1 results in FF being 2. Thus, unlike the original forest-fire 
example, L = 1 A MD = 1 is no longer a cause of F'F' = 1 according to the modified HP 
definition. This problem disappears once we take normality into Saccount; see Example 3.2.5. 
account (see Example 3.2.5). Hi 


For the original HP definition, we can take the set Z in AC2, not just in this example, but in 
general, to consist only of variables that lie on a causal path from a variable in X toa variable 
in y. Interestingly, it is also true of the modified HP definition, except in that case we have 
to talk about V — W rather than Z because V — W is the analogue of Z in the modified HP 
definition. 


Proposition 2.9.2 fx = £ is a cause of vy in (M, t) according to the original or modified 
HP definition, then there exists a witness (W, W, 2%") to this such that every variable Z € 
V — W lies ona causal path in (M, ti) from some variable in X to some variable in Y. 


For definiteness, I take X be a variable “in y” if, syntactically, X appears in y, even if 
yy is equivalent to a formula that does not involve X. For example, if Y is a binary variable, 
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I take Y to appear in X = 1A (Y = 1V/Y = 1), although this formula is equivalent 
to X = 1. (The proposition is actually true even if I assume that Y does not appear in 
X =1A(Y =1VY =1).) The proof of the proposition can be found in Section 2.10.3. 


2.10 Proofs 


In this section, I prove some of the results stated earlier. This section can be skipped without 
loss of continuity. 


2.10.1 Proof of Theorem 2.2.3 


I repeat the statements of the theorem here (and in the later sections) for the reader’s conve- 
nience. 


Theorem 2.2.3: 


(a) If X = wis part of a cause of p in (M, %) according to the modified HP definition, then 
X = x is a cause of y in (M, %) according to the original HP definition. 


(b) If X = x is part of a cause of p in (M, %) according to the modified HP definition, then 
X = x is a cause of y in (M, t) according to the updated HP definition. 


(c) If X = x is part of a cause of vy in (M, ti) according to the updated HP definition, then 
X = x is a cause of y in (M, t) according to the original HP definition. 


(d) If X = is acause of y in (M, @) according to the original HP definition, then |X| = 1 
(i.e., X is a singleton). 


Proof: For part (a), suppose that X = x is part of a cause of vy in (IM, %) according to the 
modified HP definition, so that there is a cause X = 7 such that X = 2 is one of its conjuncts. 
Tclaim that X = «x is a cause of p according to the original HP definition. By definition, there 
must exist a value x” is R(X) anda set W C V — X such that if (M,i) HE W = w*, then 
(M,@) E[Xe#, Wee *|-y. Moreover, X is minimal. 

To show that X = « is a cause according to the original HP definition, we must find an 
appropriate witness. If X = {X}, then it is immediate that (W,, w*, x’) is a witness. If |X| > 
1, suppose without loss of generality that X= (X1,...,Xn), and x= = X,. In general, if Y is 
a vector, I write 4 to denote all components of the vector except the first one, so that ~ Se 
(X2,...,X,). [want to show that X; = 2 isa cause of yin (M, i/) according to the original 
HP definition. Clearly, (M, i) —/ (X,; = 21) Ay, since X = Z is a cause of y in (M, i) 
according to the modified HP definition, so AC1 holds. The obvious candidate for a witness 
for AC2(a) is (X_1 - W, #_, - W*, «’,), where - is the operator that concatenates two vectors 
so that, for example, (1,3) - (2) = (1,3, 2); that is, roughly speaking, we move X2,...,Xn, 
into W. This satisfies AC2(a), since (M,i@) — [X, « av}, X_1 + #4,W < w*]7@ by 
assumption. AC3 trivially holds for X;, = x1, so it remains to deal with AC2(b°). Suppose, 
by way of contradiction, that (M,@) K [X, < 21,X_, + #.,,W < w*]-. This means 
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that X_; < Z_, satisfies AC2(a™), showing that AC3 (more precisely, the version of AC3 
appropriate for the modified HP definition) is violated (taking ((X,)-W, (a )-w*, 7_,) as the 
witness), and X & Zis nota cause of y in (M, i’) according to the modified HP definition, a 
contradiction. Thus, (M, @) — [X; < 2, X_1 + #.,,W wy. 

This does not yet show that AC2(b°) holds: there might be some subset Z' of variables 
in VY — x4 U W that change value when W is set to w* and x is set to #_,, and 
when these variables are set to their original values in (7, wv), y does not hold, thus vio- 
lating AC2(b"). In more detail, suppose that there exist a set Zz = (Z1,...;ZK) © UA 
and values z for each variable Z; € Z' such that (i) (M,ua) — Z; = 2} and (ii) 
(M,i@) — [X, —2,X_1 + #,,W<- we, 2 = 2*]-y. But then X = Z is not a cause 
of y in (M, cy) according the modified HP dentin: Condition (ii) shows that AC2(a”) is 
satisfied for X_,, taking ((X)-W - Z’, (a) - wW* - 7*, Z_,) as the witness, so again AC3 is 
violated. It follows that AC2(b“) holds, canipletne the es 

The proof of part (b) is similar in spirit. Indeed, we just need to show one more thing. For 
AC2(b"), we must show that if X’ C X_1,W' CW, and Z' C Z, then 


(M, i) & [X, 2, X' e+ #4,W ew, 2’ & Bly. (2.2) 


(Here I am using the abuse of notation that I referred to in Section 2.2.2, basi if x’ CX 
and @ € R(X x), I write X’ < 2, with the intention that the components of z, not included in 
X' are ignored.) It follows easily from AC1 that (2.2) holds if X' = 0). And ‘if (2.2) does not 
hold for some strict nonempty subset xX of X _1, then X = iis not a cause of y according 
to the modified HP definition because AC3 does not hold; AC2(a”) is satisfied for x’. 

For part (c), the proof is again similar in spirit. Indeed, the proof is identical up to 
the point where we must show that AC2(b?) holds. Now if Z’ C Z and (M,ii) —& 
[X1 v1,X_1 + #4, We w, ae. Z*)ny, then X_1 € #1 satisfies AC2(a) (taking 
((X1)-W- Z',(x1)-w- 2, #_,) as the witness). It also satisfies AC2(b"), since X = % 
satisfies AC2(b“), by assumption. Thus, the version of AC3 appropriate for the updated HP 
definition is violated. 

Finally, for part (d), the proof is yet again similar in spirit. Suppose that X = #isacause 
of y in (M, i) according to the original HP definition and |X| > 1. Let X = x be aconjunct 
of X = &. Again, we can show that X = x is a cause of y in (M, i’) according to the original 
HP definition, which is exactly what is needed to prove part (d), using essentially the same 
argument as above. I leave the details to the reader. If 


2.10.2 Proof of Proposition 2.4.6 


In this section, I prove Proposition 2.4.6. I start with a simple lemma that states a key (and 
obvious!) property of causal paths: if there is no causal path from X to Y, then changing the 
value of X cannot change the value of Y. This fact is made precise in the following lemma. 
Although it is intuitively obvious, proving it carefully requires a little bit of work. I translate 
the proof into statisticians’ notation as well. 
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Lemma 2.10.1 Jf Y and all the variables in X are endogenous, Y ¢ x, and there is no 
causal path in (M,%) from a variable in X to Y, then for all sets W of variables disjoint 


from X and Y, and all settings X and x" for x y for Y, and w for W, we have 


(M,@) E[X -2,WeoalY =y) f(M a) [XC #,We aly =y) 


and 


(Mi) F [XO 2,W - al(Y =y) if (Ma) KW aly =y) 
(i.e., Yea (t) =U iff Yee (t) =Y and Yea(t) =Y iff Ya(u) = y). 


Proof: Define the maximum distance of a variable Y in a causal setting (1/7, @), denoted 
maadist(Y ), to be the length of the longest causal path in (7, v7) from an exogenous variable 
to Y. We prove both parts of the result simultaneously by induction on mazdist(Y). If 
maxdist(Y ) = 1, then the value of Y depends only on the values of the exogenous variables, 
so the result trivially holds. If mazdist(Y) > 1, let Z,..., Z, be the endogenous variables 
on which Y depends. These are the parents of Y in the causal network (i.e., these are exactly 
the endogenous variables Z such that there is an edge from Z to Y in the causal network). For 
each Z € {Z,...,Z,}, maxdist(Z) < maxdist(Y): for each causal path in (IM, v) from 
an exogenous variable to Z, there is a longer path to Y, namely, the one formed by adding 
the edge from Z to Y. Moreover, there is no causal path in (IM, iw) from a variable in X to 
any of Z,,..., 2% nor is any of Z1,..., Z, in < (for otherwise there would be a causal path 
in (M, i) from a variable in X wo Y, contradicting the assumption of the lemma). Thus, the 
inductive hypothesis holds for each of 71,..., Z,. Since the value of each of 21,..., Z;, does 
not change when we change the setting of X from Z to Z’, and the value of Y depends only 
on the values of Z,,..., Z; and w (i.e., the values of the exogenous variables), the value of Y 
cannot change either. ff 


I can now prove Proposition 2.4.6. I restate it for the reader’s convenience. 
Proposition 2.4.6: Suppose that X1 = x; is a but-for cause of X_ = X2 in the causal setting 
(M,t), Xq = 2&9 is a but-for cause of X3 = x3 in (M,%), and the following two conditions 
hold: 


(a) for every value x € R(X2), there exists a value x', © R(X1) such that (M,ti) = 
[X, < 24](Xp = 24) (ie, (Xa)as (@) = 24); 

(b) Xz is on every causal path in (M, t) from X, to X3. 
Then X1 = x, is a but-for cause of X3 = £3. 
Proof: Since X2 = £2 is a but-for cause of X3 = x3 in (M, i), there exists a value 
ry # X2 such that (M, wv) - [X2 — x](X3 4 x5) (ie., (X3)a4 (UW) = 13). Choose x3 such 
that (M, wt)  [X2 — 2](X3 = 3) (ie., (X3)«;(@) = x4). By assumption, there exists 
a value a, such that (M, a) — [X1 © 24](X2 = 29) (ie., (X2)e:(@) = ©). I claim that 
(M,t) - [X1 — 2](X3 = x3) (e., (X3)o: (@) = x3). This follows from a more general 
claim. I show that if Y is on a causal path from X2 to X3, then 


(M,w) F [X1 + 24 |(Y = y) iff (M, ti) F [X2 = a5|(Y = y) (2.3) 
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(ie., Yix (U) = y iff Y (u) = y). 

Define a partial order < on endogenous variables that lie on a causal path from X2 to X3 
by taking Y; ~< Y2 if Y, precedes Y2 on some causal path from X2 to X3. Since M is a 
recursive model, if Y, and Y9 are distinct variables and Y; ~ Y2, we cannot have Ya ~ Yj 
(otherwise there would be a cycle). I prove (2.3) by induction on the < ordering. The least 
element in this ordering is clearly X2; X2 must come before every other variable on a causal 
path from X5 to X3. (M,t) F [X1 © a) ](X2 = 29) (e., (X2)a" (W) = 79) by assumption, 
and clearly (M, tw) F [X2 < x§](X2 = x9) (ie., (X2)o,(U) = ‘25). Thus, (2.3) holds for 
X 2. Thus completes the base case of the induction. 


For the inductive step, let Y be a variable that lies on a causal path in (M, i) from X2 and 
X3, and suppose that (2.3) holds for all variables Y’ such that Y’ < Y. Let 71,..., Z be the 
endogenous variables that Y depends on in M. For each of these variables Z;, either there is a 
causal path in (IM, i) from X, to Z; or there is not. If there is, then the path from X, to Z; can 
be extended to a directed path P from X, to X3 by going from Xj to Z;, from Z; to Y, and 
from Y to X3 (since Y lies on a causal path in (1, v) from X2 to X3). Since, by assumption, 
X» lies on every causal path in (M, %) from X1 to X3, X2 must lie on P. Moreover, X2 must 
precede Y on P. (Proof: Since Y lies ona path P’ from X2 to X3, X2 must precede Y on P’. 
If Y precedes X2 on P, then there is a cycle, which is a contradiction.) Since 7; precedes Y 
on P, it follows that Z; < Y, so by the inductive hypothesis, (/, w@) — [X1 < 24](Z; = z;) 

Now if there is no causal path in (M, v) from X, to Z;, then there also cannot be a causal 
path P from X2 to Z; (otherwise there would be a causal path in (M, i’) from X1 to Z; formed 
by appending P to a causal path from X, to X2, which must exist because, if not, it easily 
follows from Lemma 2.10.1 that X; = x; would not be a cause of X2 = x2). Since there is 
no causal path in (IM, i“) from X, to Z;, we must have that (M, @) — [Xi < 24|(Z; = 2) iff 


Since the value of Y depends only on the values of Z;,.. » Sk and wu, and I have 
just shown that (M,v) — [X, © a@j](Z4 = aA...A Zr zp) iff (I, a) E 
[Xo + x6]|(Z1 = 2 A...A Zp = 2) (1.e., (21) (U) = 21 NAN ae A(Z. ie 


lI 
aa 
a 


(21), (tt) = 21 A... A (Ze) a (@) = 2) it follows that (M, w) 
(M, t) | [Xo = 25|(Y = y) Ge. Yu (W) = y iff Yr, (@) = y). 
This completes the proof of the induction step. Since X3 is on a causal path in 


(M,u) from X2 to X3, it now follows that (M,u%) — [Xi < w{|(X3 = 23) iff 
(M,tu) = [Xo + 29](X3 = 25) (ie. (X3)o(W) = 2 iff (X3)e,(d) = 2). Since 
(M,ti) — [X2 © «a ](X3 = 2) (ie., (X3)a2(U@) = 2) by construction, we have that 


(M,t) — [X1 + 2](X3 = 2) (ie. (X3) a (uz) = x), as desired. Thus, X, = 2) isa 
but-for cause for X3 = x3. ff 


2.10.3 Proof of Proposition 2.9.2 


In this section, I prove Proposition 2.9.2. 
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Proposition 2.9.2: ifX = £ is cause of py in (M, t) according to the original or modified HP 
definition, then there exists a witness (W, w, #') to this such that every variable Z © V — W 
lies on a causal path in (M,%) from some variable in X to some variable in Y. 


To prove the result, I need to prove a general property of causal models, one sufficiently 
important to be an axiom in the axiomatization of causal reasoning given in Section 5.4. 


=> 


Lemma 2.10.2. If (M,i) K [X < @](Y = 9), then (M,i) — [X © Zp if and only if 
(M,i) K(X —2,Y <J]p. 


Proof: In the unique solution to the equations when X is set to #, Y = y. So the unique 
solution to the equations when X is set to Z is the same as the unique solution to the equations 
when X is set to Z and Y is set to y. The result now follows immediately. ff 


With this background, I can prove Proposition 2.9.2. 


Proof of Proposition 2.9.2: Suppose that X = @isa cause of y in (M, ut) with witness 
(W, w, 2’) according to the original (resp., modified) HP definition. Let Z=V-W, let 
Z' consist of the variables in Z that lie on a causal path in (7, i) from some variable in 
X to some variable in y, and let W’ = V — Z’. Notice that W’ is a superset of W. Let 
W'—-W =Y ={Y,...,Y,}. The proof diverges slightly now depending on whether we 
consider the original or modified HP definition. In the case of the original HP definition, let 
y; be the value of Y; such that (IM, uv) — [XceWe wi(Y; = y;); in the case of the 
modified HP defwiton: let y; be the value of Y; such that (/,v) — Y; = y,;. I claim that 
(W-Y,w- 7,2") isa witness to X = Z being a cause of y in (M, @) according to the original 
(resp., modified) HP definition. (Recall from the proof of Theorem 2.2.3 that x - y denotes 
the result of concatenating the vectors Z and ¥.) Once I prove this, I will have proved the 
proposition. 

Since X = Z is a cause of yp by assumption, AC1 and AC3 hold; they are independent of 
the witness. This suffices to prove that AC2(a) and AC2(b°) (resp., AC2(a’””)) hold for this 
witness. 

Since (W, i, x’) is a witness to X = being a cause of y in (M, i) according to the 
original (resp., modified) HP definition, by AC2(a) (resp., AC2(a”’)), it follows that 


(M,@) K(X + #,Weal-y. (2.4) 


Since no variable in Y is on a causal path in (1M, z7) from a variable in X toa variable in 
y, for each Y € Y, either there is no causal path in (1, %) from a variable in X to Y or there 
is no causal path in (/, w) from Y to a variable in y (or both). Without loss of generality, we 
can assume that the variables in Y are ordered so that there is no causal path in (1, z) from a 
variable in X to any of the first 7 variables, Y|,..., Y;, and there is no causal path in (M, i) 
from any of the last k — 7 variables, Y;+1,..., Yz, to a variable in y. 

I first do the argument in the case of the modified HP definition. In this case, w must 
be the value of the variables in W in context @; by definition, y is the value of the vari- 
ables in Y. Thus, W - Y is a legitimate witness for AC2(a™). Moreover, since (M, i) — 
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= y), by Lemma 2.10.2, we have 


(Ma) K (We w(Y =m... 


Since there is no causal path in (M, @) from X to any of Yj 
follows that 


69 


Yj = yj): 


,-»+, Yj, by Lemma 2.10.1, it 


(M,a) E[Xe#, Wed =yA...AY; =y;). (2.5) 
Thus, by (2.4), (2.5), and Lemma 2.10.2, it follows that 
(Ma) [Xo#, Wed O,...,¥ Cyl. (2.6) 


Finally, since there is no causal path in (VM, v) from any of Y;41,... 


it follows from (2.6) and Lemma 2.10.1 that 


(M,i) K(X — 2,W < 


w, Yi < 


so AC2(a™) holds, and X = 
definition with witness (W -Y,w- i, 7”). 


, Y; to any variable in y, 


Y1y--- Yk — Yk] 79, 


& is a cause of y in (M,%) according to the modified HP 


We have to work a little harder in the case of the original HP definition (although the basic 


ideas are similar) because we can no longer assume that (M, ve 
of 7 in this case, we have (M,@) E[X + #,W< wl(Y = 


EW=w. By the definition 
us, emma 2.10.2, 
. Thus, by L 2.10.2 


(M,i) kK [X + 2,.W < peas 
This shows that AC2(a) holds for the witness (W - Y,w- 7, 2”). 
For AC2(b?), we must show that (M,i@) E [X + «,W <— w,Y < gly. By assumption, 
(M,@) E[X —2,W edly (2.7) 
IfY € {¥%,..., Yj}, then by choice of y, (M,i@) K [X < #,W < w(Y = y). Thus, by 
Lemma 2.10.1, 
(M,@) K(X —2,W aly =y). (2.8) 
By (2.7), (2.8), and Lemma 2.10.2, 
(Ma) F(X H#ZWeONC,...,¥ Cyle. (2.9) 


Finally, since there is no causal path in (M, v) from any of Yj41,... 


follows from (2.9) and Lemma 2.10.1 that 


=> 


(M,i) E[X —2,W < 


w,Y, ¢ 
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a cause of y in (IM, v) according to the original HP definition with witness 
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Notes 


The use of structural equations in models for causality goes back to the geneticist Sewall 
Wright [1921] (see [Goldberger 1972] for a discussion). Herb Simon [1953], who won a No- 
bel Prize in Economics (and, in addition, made fundamental contributions to Computer Sci- 
ence), and the econometrician Trygve Haavelmo [1943] also made use of structural equations. 
The formalization used here is due to Judea Pearl, who has been instrumental in pushing for 
this approach to causality. Pearl [2000] provides a detailed discussion of structural equations, 
their history, and their use. 


The assumption that the model is recursive is standard in the literature. The usual definition 
says that if the model is recursive, then there is a total order = of the variables such that X 
affects Y only if X =< Y. I have made two changes to the usual definition here. The first 
is to allow the order to be partial; the second is to allow it to depend on the context. As the 
discussion of the rock-throwing example (Example 2.3.3) shows, the second assumption is 
useful and makes the definition more general; a good model of the rock-throwing example 
might well include contexts where Billy hits before Suzy if they both throw. Although it 
seems more natural to me to assume a partial order as I have done here, since the partial order 
can be read off the causal network, assuming that the order is partial is actually equivalent to 
assuming that it is total. Every total order is a partial order, and it is well known that every 
partial order can be extended (typically, in many ways) to a total order. If =’ is a total order 
that extends X and X affects Y only if X = Y, then certainly X affects Y only if X =’ Y. 
Allowing the partial order to depend on the context vis also nonstandard, but a rather minimal 
change. 


As I said in the notes to Chapter 1, the original HP definition was introduced by Halpern 
and Pearl in [Halpern and Pearl 2001]; it was updated in [Halpern and Pearl 2005a]; the 
modified definition was introduced in [Halpern 2015a]. These definitions were inspired by 
Pearl’s original notion of a causal beam [Pearl 1998]. (The definition of causal beam in [Pearl 
2000, Chapter 10] is a modification of the original definition that takes into account concerns 
raised in an early version of [Halpern and Pearl 2001].) Interestingly, according to the causal 
beam definition, A qualifies as an actual cause of B only if something like AC2(a™) rather 
than AC2(a) holds; otherwise it is called a contributory cause. The distinction between actual 
cause and contributory cause is lost in the original and updated HP definition. To some extent, 
it resurfaces in the modified HP definition; in some cases, what the causal beam definition 
would classify as a contributory cause but not an actual cause would be classified as part of a 
cause but not a cause according to the modified HP definition. 


Although I focus on (variants of) the HP definition of causality in this book, it is not the 
only definition of causality that is given in terms of structural equations. Besides Pearl’s 
causal beam notion, other definitions were given by (among others) Glymour and Wimberly 
[2007], Hall [2007], Hitchcock [2001, 2007], and Woodward [2003]. Problems have been 
pointed out with all these definitions; see [Halpern 2015a] for a discussion of some of them. 
As I mentioned in Chapter 1, there are also definitions of causality that use counterfactuals 
without using structural equations. The best known is due to Lewis [1973a, 2000]. Paul and 
Hall [2013] cover a number of approaches that attempt to reduce causality to counterfactuals. 
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Proposition 2.2.2 and parts (a) and (b) of Theorem 2.2.3 are taken from [Halpern 2015a]. 
Part (d) of Theorem 2.2.3, the fact that with the original HP definition causes are always single 
conjuncts, was proved (independently) by Hopkins [2001] and Eiter and Lukasiewicz [2002]. 


Much of the material in Sections 2.1—2.3 is taken from [Halpern and Pearl 2005a]. The 
formal definition of causal model is taken from [Halpern 2000]. Note that in the literature 
(especially the statistics literature), what I call here “variables” are called random variables. 


In the philosophy community, counterfactuals are typically defined in terms of “closest 
worlds” [Lewis 1973b; Stalnaker 1968]; a statement of the form “if A were the case then B 
would be true” is taken to be true if in the “closest world(s)” to the actual world where A is 
true, B is also true. The modification of equations may be given a simple “closest world” 
interpretation: the solution of the equations obtained by replacing the equation for Y with 
the equation Y = y, while leaving all other equations unaltered, can be viewed as the closest 
world to the actual world where Y = y. 


The asymmetry embodied in the structural equations (i.e., the fact that variables on the left- 
and right-hand sides of the equality sign are treated differently) can be understood in terms of 
closest worlds. Suppose that in the actual world, the arsonist does not drop a match, there is 
no lightning, and the forest does not burn down. If either the match or lightning suffices to 
start the fire, then in the closest world to the actual world where the arsonist drops a lit match, 
the forest burns down. However, it is not necessarily the case that the arsonist drops a match 
in the world closest to the actual world where the forest burns down. The formal connection 
between causal models and closest-world semantics for counterfactuals is somewhat subtle; 
see [Briggs 2012; Galles and Pearl 1998; Halpern 2013; Zhang 2013] for further discussion. 


The question of whether there is some circularity in using causal models, where the struc- 
tural equations can arguably be viewed as encoding causal relationships, to provide a model 
of actual causality was discussed by Woodward [2003]. As he observed, if we intervene to 
set X to x, the intervention can be viewed as causing X to have value x. But we are not 
interested in defining a causal relationship between interventions and the values of variables 
intervened on. Rather, we are interested in defining a causal relation between the values of 
some variables and the values of otherwise; for example, we want to say that the fact that 
X = «x in the actual world is a cause of Y = y in the actual world. As Woodward points out 
(and I agree), the definition of actual causality depends (in part) on the fact that intervening 
on X by setting it to x causes X = 2, but there is no circularity in this dependence. 


The causal networks that are used to describe the equations are similar in spirit to Bayesian 
networks, which have been widely used to represent and reason about (conditional) depen- 
dencies in probability distributions. Indeed, once we add probability to the picture as in 
Section 2.5, the connection is even closer. Historically, Judea Pearl [1988, 2000] introduced 
Bayesian networks for probabilistic reasoning and then applied them to causal reasoning. This 
similarity to Bayesian networks will be exploited in Chapter 5 to provide insight into when 
we can obtain compact representations of causal models. 


Spirtes, Glymour, and Scheines [1993] study causality by working directly with graphs, 
not structural equations. They consider type causality rather than actual causality; their focus 
is on (algorithms for) discovering causal structure and causal influence. This can be viewed as 
work that will need to be done to construct the structural equations that I am taking as given 
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here. The importance of choosing the “right” set of variables comes out clearly in the Spirtes 
et al. framework as well. 

The difference between conditions and causes in the law is discussed by Katz [1987] and 
Mackie [1965], among others. 

In the standard philosophical account, causality relates events, that is, for A to be a cause 
of B, A and B have to be events. But what counts as an event is open to dispute. There 
are different theories of events. (Casati and Varzi [2014] provide a recent overview of this 
work; Paul and Hall [2013, pp. 58-60] discuss its relevance to theories of causality.) A 
major issue is whether something not happening counts as an event [Paul and Hall 2013, pp. 
178-182]. (This issue is also related to that of whether omissions count as causes, discussed 
earlier.) The As and Bs that are the relata of causality in the HP definition are arguably closer 
to what philosophers have called true propositions (or facts) [Mellor 2004]. Elucidating the 
relationship between all these notions is well beyond the scope of this book. 

Although I have handled the rock-throwing example here by just adding two additional 
variables (BH and SH), as I said, this is only one of many ways to capture the fact that 
Suzy’s rock hit the bottle and Billy’s did not. The alternative of using variables indexed by 
time was considered in [Halpern and Pearl 2005a]. 

Examples 2.3.4 (double prevention), 2.3.5 (omissions as causes), and 2.4.1 (lack of tran- 
sitivity) are due to Hall; their descriptions are taken from (an early version of) [Hall 2004]. 
Again, these problems are well known and were mentioned in the literature much earlier. 
See Chapter 3 for more discussion of causation by omission and the notes to that chapter for 
references. 

Example 2.3.6 is due to Paul [2000]. In [Halpern and Pearl 2005a], it was modeled a little 
differently, using variables LT (for train is on the left track) and RT (for train is on the right 
track). Thus, in the actual context, where the engineer flips the switch and the train goes 
down the right track, we have F = 1, LT = 0, and RT = 0. With this choice of variables, 
all three variants of the HP definition declare F' = 1 a cause of A = 1 (we can fix RT at 
its actual value of 0 to get the counterfactual dependence of A on F’). This problem can be 
dealt with using normality considerations (see the last paragraph of Example 3.4.3). In any 
case, as Hitchock and I pointed out [Halpern and Hitchcock 2010], this choice of variables is 
somewhat problematic because LT and RT are not independent. What would it mean to set 
both LT and RT to 1, for example? That the train is simultaneously going down both tracks? 
Using LB and FB, as done in [Halpern and Hitchcock 2010], captures the essence of Hall’s 
story while avoiding this problem. The issue of being able to set variables independently is 
discussed in more detail in Section 4.6. 

Example 2.3.8 is taken from [Halpern 2015a]. 

O’Connor [2012] says that, each year, an estimated 4,000 cases of “retained surgical items” 
are reported in the United States and discusses a lawsuit resulting from one such case. 

Example 2.3.7 is due to Bas van Fraassen and was introduced by Schaffer [2000b] under 
the name trumping preemption. Schaffer (and Lewis [2000]) claimed that, because orders 
from higher-ranking officers trump those of lower-ranking officers, the captain is a cause of 
the charge and the sergeant is not. The case is not so clearcut to me; the analysis in the main 
text shows just how sensitive the determination of causality is to details of the model. Halpern 
and Hitchcock [2010] make the point that if C and S' have range {0, 1} (and the only variables 
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in the causal model are C’, S, and P), then there is no way to capture the fact that in the case 
of conflicting orders, the private obeys the captain. The model of the problem that captures 
trumping with the extra variable SE, described in Figure 2.7, was introduced in [Halpern and 
Pearl 2005a]. 


Example 2.4.2 is due to McDermott [1995], who also gives other examples of lack of 
transitivity, including Example 3.5.2, discussed in Chapter 3. 


The question of the transitivity of causality has been the subject of much debate. As Paul 
and Hall [2013] say, “Causality seems to be transitive. If C causes D and D causes E, then C 
thereby causes £.” Indeed, Paul and Hall [2013, p. 215] suggest that “preserving transitivity 
is a basic desideratum for an adequate analysis of causation”. Lewis [1973a, 2000] imposes 
transitivity in his definition of causality by taking causality to be the transitive closure (“‘ances- 
tral’, in his terminology) of a one-step causal dependence relation. But numerous examples 
have been presented that cast doubt on transitivity; Richard Scheines [personal communica- 
tion, 2013] suggested the homeostasis example. Paul and Hall [2013] give a sequence of such 
counterexamples, concluding that “What’s needed is a more developed story, according to 
which the inference from ‘C’ causes D’ and ‘D causes E” to ‘C causes E” is safe provided 
such-and-such conditions obtain—where these conditions can typically be assumed to obtain, 
except perhaps in odd cases ...”. Hitchcock [2001] argues that the cases that create problems 
for transitivity fail to involve appropriate “causal routes” between the relevant events. Propo- 
sitions 2.4.3, 2.4.4, and 2.4.6 can be viewed as providing other sets of conditions for when we 
can safely assume transitivity. These results are proved in [Halpern 2015b], from where much 
of the discussion in Section 2.4 is taken. Since these definitions apply only to but-for causes, 
which all counterfactual-based accounts (not just the HP accounts) agree should be causes, 
the results should hold for almost all definitions that are based on counterfactual dependence 
and structural equations. 


As I said, there are a number of probabilistic theories of causation that identify causality 
with raising probability. For example, Eells and Sober [1983] take C to be a cause of EF 
if Pr(E | K; \C) > Pr(E | K; A 7C) for all appropriate maximal specifications K; of 
causally relevant background factors, provided that these conditionals are defined (where they 
independently define what it means for a background factor to be causally relevant); see also 
[Eells 1991]. Interestingly, Eells and Sober [1983] also give sufficient conditions for causality 
to be transitive according to their definition. This definition does not involve counterfactuals 
(so the results of Section 2.4 do not apply). 


There are also definitions of probabilistic causality in this spirit that do involve counterfac- 
tuals. For example, Lewis [1973a] defines a probabilistic version of his notion of dependency, 
by taking A to causally depend on B if and only if, had B not occurred, the chance of A’s oc- 
curring would be much less than its actual chance; he then takes causality to be the transitive 
closure of this causal dependence relation. There are several other ways to define causality in 
this spirit; see Fitelson and Hitchcock [2011] for an overview. In any case, as Example 2.5.2 
shows, no matter how this is made precise, defining causality this way is problematic. Perhaps 
the best-known example of the problem is due to Rosen and is presented by Suppes [1970]: A 
golfer lines up to driver her ball, but her swing is off and she slices it badly. The slice clearly 
decreases the probability of a hole-in-one. But, as it happens, the ball bounces off a tree trunk 
at just the right angle that, in fact, the golfer gets a hole-in-one. We want to say that the slice 
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caused the hole-in-one (and the definition given here would say that), although slicing the ball 
lowers the probability of a hole-in-one. 

The difference between the probability of rain conditional on a low barometer reading and 
the probability that it rains as a result of intervening to set the barometer reading to low has 
been called by Pearl [2000] the difference between seeing and doing. 

Examples 2.5.1, 2.5.2, and 2.5.3 are modifications of examples in [Paul and Hall 2013], 
with slight modifications; the original version of Example 2.5.3 is due to Frick [2009]. As I 
said, the idea of “pulling out the probability” is standard in computer science; see [Halpern 
and Tuttle 1993] for further discussion and references. Northcott [2010] defends the view 
that using deterministic models may capture important features of the psychology of causal 
judgment; he also provides a definition of probabilistic causality. Fenton-Glynn [2016] pro- 
vides a definition of probabilistic causality in the spirit of the HP definitions; unfortunately, 
he does not consider how his approach fares on the examples in the spirit of those discussed 
here. Much of the material in Section 2.5 is taken from [Halpern 2014b]. 

Example 2.5.4 is in the spirit of an example considered by Hitchcock [2004]. In his ex- 
ample, both Juan and Jennifer push a source of polarized photons, which greatly increases 
the probability that a photon emitted by the source will be transmitted by a polarizer. In fact, 
the polarizer does transmit a photon. Hitchcock says, “It seems clearly wrongheaded to ask 
whether the transmission was really due to Juan’s push rather than Jennifer’s push. Rather 
...each push increased the probability of the effect (the transmission of the photon), and then 
it was simply a matter of chance. There are no causal facts of the matter extending beyond 
the probabilistic contribution made by each. This is a very simple case in which our intuitions 
are not clouded by tacit deterministic assumptions.” 

Hitchcock’s example involves events at the microscopic level, where arguably we cannot 
pull out the probability. But for macroscopic events, where we can pull out the probability, as 
is done in Section 2.5, I believe that we can make sense out of the analogous questions, and 
deterministic assumptions are not at all unreasonable. 

Balke and Pearl [1994] give a general discussion of how to evaluate probabilistic queries 
in a causal model where there is a probability on contexts. 

Representation of uncertainty other than probability, including the use of sets of probability 
measures, are discussed in [Halpern 2003]. 

Lewis [1986b, Postscript C] discussed sensitive causation and emphasized that in long 
causal chains, causality was quite sensitive. Woodward [2006] goes into these issues in much 
more detail and makes the point that people’s causality judgments are clearly influenced by 
how sensitive the causality ascription is to changes in various other factors. He points out that 
one reason some people tend to view double prevention and absence of causes as not quite 
“first-class” causes might be that these are typically quite sensitive causes; in the language of 
Section 2.6, they are sufficient causes with only low probability. 

Pearl and I [2005a] considered what we called strong causality, which is intended to cap- 
ture some of the same intuitions behind sufficient causality. Specifically, we extended the 
updated definition by adding the following clause: 


AC2(c). (M, i) E [X — 2,W < wp for all settings w” of W. 


Thus, instead of requiring [Xx < Zl to hold in all contexts, some of the same effect is 
achieved by requiring that y holds no matter what values of W are considered. The definition 
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given here seems somewhat closer to the intuitions and allows the consideration of probabilis- 
tic sufficient causality by putting a probability on contexts. This, in turn, makes it possible to 
bring out the connections between sufficient causality, normality, and blame. 

Although Pearl [1999, 2000] does not define the notion of sufficient causality, he does talk 
about the probability of sufficiency, that is, the probability that X = « is a sufficient cause 
of Y = y. Roughly speaking, this is the probability that setting X to x results in Y = y, 
conditional on X # x and Y ¥ y. Ignoring the conditioning, this is very close in spirit 
to the definition given here. Datta et al. [2015] considered a notion close to the notion of 
sufficient causality defined here and pointed out that it could be used to distinguish joint and 
independent causality. Honoré [2010] discusses the distinction between joint and independent 
causation in the context of the law. Gerstenberg et al. [2015] discuss the impact of sufficient 
causality on people’s causality judgments. 

The definition of causality in nonrecursive causal models is taken from the appendix of 
[Halpern and Pearl 2005a]. Strotz and Wold [1960] discuss recursive and nonrecursive models 
as used in econometrics. They argue that our intuitive view of causality really makes sense 
only in recursive models, and that once time is taken into account, a nonrecursive model can 
typically be viewed as recursive. 

Example 2.8.1, which motivated AC2(b"), is due to Hopkins and Pearl [2003]. Exam- 
ple 2.8.2 is taken from [Halpern 2008]. Hall [2007] gives a definition of causality that he calls 
the H-account that requires (2.1) to hold (but only for W, not for all subsets W’ iC W). Ex- 
ample 2.8.3, which is taken from [Halpern and Pearl 2005a], shows that the H-account would 
not declare V; = 1 a cause of P = 1, which seems problematic. This example also causes 
related problems for Pearl’s [1998, 2000] causal beam definition of causality. 

Hall [2007], Hitchcock [2001], and Woodward [2003] are examples of accounts of causal- 
ity that explicitly appeal to causal paths. In [Halpern and Pearl 2005a, Proposition A.2], it is 
claimed that all the variables in Z can be taken to lie on a causal path from a variable in x fe 
vy. As Example 2.9.1 shows, this is not true in general. In fact, the argument used in the proof 
of [Halpern and Pearl 2005a, Proposition A.2] does show that with AC2(b?), all the variables 
in Z can be taken to lie on a causal path from a variable in X to yp (and is essentially the ar- 
gument used in the proof of Proposition 2.9.2). The definition of causality was changed from 
using AC2(b°) to AC2(b“) in the course of writing [Halpern and Pearl 2005a]; Proposition 
A.2 was not updated accordingly. 
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Chapter 3 


Graded Causation and Normality 


Perhaps the feelings that we experience when we are in love represent a normal 
State. 


Anton Chekhov 


In a number of the examples presented in Chapter 2, the HP definition gave counterintuitive 
ascriptions of causality. I suggested then that taking normality into account would solve 
these problems. In this chapter, I fill in the details of this suggestion and show that taking 
normality into account solves these and many other problems, in what seems to me a natural 
way. Although I apply ideas of normality to the HP definition, these ideas should apply 
equally well to a number of other approaches to defining causality. 

This approach is based on the observation that norms can affect counterfactual reasoning. 
To repeat the Kahneman and Miller quote given in Chapter 1, “[in determining causality], an 
event is more likely to be undone by altering exceptional than routine aspects of the causal 
chain that led to it”. In the next section, I give a short discussion of issues of defaults, typi- 
cality, and normality. I then show how the HP definition can be extended to take normality 
into account. The chapter concludes by showing how doing this deals with the problematic 
examples from Chapter 2 as well as other concerns. 


3.1 Defaults, Typicality, and Normality 


The extended account of actual causation incorporates the concepts of defaults, typicality, and 
normality. These are related, although somewhat different notions: 


= A default is an assumption about what happens, or what is the case, when no additional 
information is given. For example, we might have as a default assumption that birds 
fly. If we are told that Tweety is a bird and given no further information about Tweety, 
then it is natural to infer that Tweety flies. Such inferences are defeasible: they can be 
overridden by further information. If we are additionally told that Tweety is a penguin, 
we retract our conclusion that Tweety flies. 


77 
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= To say that birds typically fly is to say not merely that flight is statistically prevalent 
among birds, but also that flight is characteristic of the type “bird”. Although not all 
birds fly, flying is something that we do characteristically associate with birds. 


# The word normal is interestingly ambiguous. It seems to have both a descriptive and a 
prescriptive dimension. To say that something is normal in the descriptive sense is to 
say that it is the statistical mode or mean (or close to it). However, we often use the 
shorter form norm in a more prescriptive sense. To conform with a norm is to follow a 
prescriptive rule. Prescriptive norms can take many forms. Some norms are moral: to 
violate them would be morally wrong. For example, many people believe that there are 
situations in which it would be wrong to lie, even if there are no laws or explicit rules 
forbidding this behavior. Laws are another kind of norm, adopted for the regulation 
of societies. Policies that are adopted by institutions can also be norms. For instance, 
a company may have a policy that employees are not allowed to be absent from work 
unless they have a note from their doctor. There can also be norms of proper functioning 
in machines or organisms. There are specific ways that human hearts and car engines 
are supposed to work, where “supposed” here has not merely an epistemic force, but 
a kind of normative force. Of course, a car engine that does not work properly is not 
guilty of a moral wrong, but there is nonetheless a sense in which it fails to live up to a 
certain kind of standard. 


Although this might seem like a heterogeneous mix of concepts, they are intertwined in a 
number of ways. For example, default inferences are successful to the extent that the default 
is normal in the statistical sense. Adopting the default assumption that a bird can fly facili- 
tates successful inferences in part because most birds are able to fly. Similarly, we classify 
objects into types in part to group objects into classes, most of whose members share cer- 
tain features. Thus, the type “bird” is useful partly because there is a suite of characteristics 
shared by most birds, including the ability to fly. The relationship between the statistical and 
prescriptive senses of “normal” is more subtle. It is, of course, quite possible for a majority 
of individuals to act in violation of a given moral or legal norm. Nonetheless, it seems that 
the different kinds of norms often serve as heuristic substitutes for one another. For example, 
there are well-known experiments showing that we are often poor at explicit statistical rea- 
soning, employing instead a variety of heuristics. Rather than tracking statistics about how 
many individuals behave in a certain way, we might well reason about how people should 
behave in certain situations. The idea is that we use a script or a template for reasoning about 
certain situations, rather than actual statistics. Prescriptive norms of various sorts can play a 
role in the construction of such scripts. It is true, of course, that conflation of the different 
sorts of norm can sometimes have harmful consequences. Less than a hundred years ago, for 
example, left-handed students were often forced to learn to write with their right hands. In 
retrospect, this looks like an obviously fallacious inference from the premise that the majority 
of people write with their right hand to the conclusion that it is somehow wrong to write with 
the left hand. But the ease with which such an inference was made illustrates the extent to 
which we find it natural to glide between the different senses of “norm”. 

When Kahneman and Miller say that “an event is more likely to be undone by altering 
exceptional than route aspects of the causal chain that led to it’, they are really talking about 
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how we construct counterfactual situations. They are also assuming a close connection be- 
tween counterfactuals and causality, since they are effectively claiming that the counterfactual 
situation we consider affects our causality judgment. The fact that norms influence causal 
judgment has been confirmed experimentally (although the experiments do not show that the 
influence goes via counterfactual judgments). To take just one example (see the notes at the 
end of the chapter for others), consider the following story from an experiment by Knobe and 
Fraser: 


The receptionist in the philosophy department keeps her desk stocked with pens. 
The administrative assistants are allowed to take the pens, but faculty members 
are supposed to buy their own. The administrative assistants typically do take the 
pens. Unfortunately, so do the faculty members. The receptionist has repeatedly 
emailed them reminders that only administrative assistants are allowed to take 
the pens. On Monday morning, one of the administrative assistants encounters 
Professor Smith walking past the receptionist’s desk. Both take pens. Later that 
day, the receptionist needs to take an important message .. . but she has a problem. 
There are no pens left on her desk. 


People are much more likely to view Prof. Smith as the cause of there not being pens than 
the administrative assistant. The HP definition does not distinguish them, calling them both 
causes. The extended HP definition given in the next section does allow us to distinguish 
them. Since the HP definition is based on counterfactuals, the key step will be to have the 
choice of witness be affected by normality considerations. Although the assumption that 
normality considerations affect counterfactual judgments does not follow immediately from 
the experimental evidence, it seems like a reasonable inference, given the tight connection 
between causality and counterfactuals. 


3.2 Extended Causal Models 


I add normality to causal models by assuming that an agent has, in addition to a theory of 
causal structure (as modeled by the structural equations), a theory of “normality” or “typ- 
icality”. This theory includes statements of the form “typically, people do not put poison 
in coffee”. There are many ways of giving semantics to such typicality statements (see the 
notes at the end of the chapter). All of them give ways to compare the relative “normality” of 
some objects. For now, I take those objects to be the worlds in a causal model M, where a 
world is an assignment of values to the endogenous variables; that is, a complete description 
of the values of the endogenous variables in M. I discuss after Example 3.2.6 why a world 
here is taken to be an assignment only to the endogenous variables and not an assignment 
to both exogenous and endogenous variables, as in the complete assignments considered in 
Section 2.7. A complete assignment is essentially equivalent to the standard philosophers’ 
notion of a world as a maximal state of affairs, so my usage of “world” here is not quite the 
same as the standard philosophers’ usage. In Section 3.5, I discuss an alternative approach, 
which puts the normality ordering on contexts, much in the spirit of putting a probability on 
contexts as was done in Section 2.5. Since (in a recursive model) a context determines the 


Downloaded from http://direct.mit.edu/books/book-pdf/2262849/book_9780262336611.pdf by guest on 04 April 2024 


80 Chapter 3. Graded Causation and Normality 


values of all endogenous variables, a context can be identified with a complete assignment, 
and thus with the philosophers’ notion of world. But for now I stick with putting a normality 
ordering on worlds as I have defined them. 


Most approaches to normality/typicality implicitly assume a total normality ordering on 
worlds: given a pair of worlds w and w’, either w is more normal than w’, w is less normal 
than w’, or the two worlds are equally normal. The approach used here has the advantage of 
allowing an extra possibility: w and w’ are incomparable as far as normality goes. (Techni- 
cally, this means that the normality ordering is partial, rather than total.) 


We can think of a world as a complete description of a situation given the language deter- 
mined by the set of endogenous variables. Thus, a world in the forest-fire example might be 
one where U = (0,1), L = 0, MD = 1, and FF = 0; the match is dropped, there is no 
lightning, and no forest fire. As this example shows, a “world” does not have to satisfy the 
equations of the causal model. 


For ease of exposition, I make a somewhat arbitrary stipulation regarding terminology. 
In what follows, I use “default” and “typical” when talking about individual variables or 
equations. For example, I might say that the default value of a variable is zero or that one 
variable typically depends on another in a certain way. I use “normal” when talking about 
worlds. Thus, I say that one world is more normal than another. I assume that it is typical 
for endogenous variables to be related to one another in the way described by the structural 
equations of a model, unless there is some specific reason to think otherwise. This ensures that 
the downstream consequences of what is typical are themselves typical (again, in the absence 
of any specific reason to think otherwise). 


Other than this mild restriction, I make no assumptions on what gets counted as typical. 
This introduces a subjective element into a causal model. Different people might well use 
different normality orderings, perhaps in part due to focusing on different aspects of normality 
(descriptive normality, typicality, prescriptive normality, etc.). This might bother those who 
think of causality as an objective feature of the world. In this section, I just focus on the 
definitions and give examples where the normality ordering seems uncontroversial; I return to 
this issue in Chapter 4. I will point out for now that there are already subjective features in a 
model: the choice of variables and which variables are taken to be exogenous and endogenous 
is also determined by the modeler. 


I now extend causal models to provide a formal model of normality. I do so by assuming 
that there is a partial preorder = over worlds. Intuitively, s > s’ means that s is at least 
as normal as s’. Recall from Section 2.1 that a partial order on a set S is a reflexive, anti- 
symmetric, and transitive relation on S. A partial preorder is reflexive and transitive but not 
necessarily anti-symmetric; this means that we can have two worlds s and s’ such that s > s’ 
and s’ > s without having s = s’. That is, we can have two distinct worlds where each is 
at least as normal as the other. By way of contrast, the standard ordering > on the natural 
numbers is anti-symmetric; if > y and y > z, then x = y. (The requirement of anti- 
symmetry is what distinguishes a (total or partial) order from a preorder.) I write s > s’ if 
s = s’ and it is not the case that s’ > s, and s = s’ ifs > s' and s’ > s. Thus, s > s’ means 
that s is strictly more normal than s’, whereas s = s’ means that s and s’ are equally normal. 
Note that I am not assuming that > is total; it is quite possible that there are two worlds s and 
s’ that are incomparable in terms of normality. The fact that s and s’ are incomparable does 
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not mean that s and s’ are equally normal. We can interpret it as saying that the agent is not 
prepared to declare either s or s’ as more normal than the other and also not prepared to say 
that they are equally normal; they simply cannot be compared in terms of normality. 

Recall that normality can be interpreted in terms of frequency. Thus, we say that birds 
typically fly in part because the fraction of flying birds is much higher than the fraction of 
non-flying birds. We might consider a world where a bird can fly to be more normal than 
one where a bird cannot fly. If we were to put a probability on worlds, then a world where a 
bird flies might well have a greater probability than one where a bird does not fly. Although 
we could interpret s > s’ as meaning “s is more probable than s’”, this interpretation is not 
always appropriate. For one thing, interpreting > in terms of probability would make ~ a total 
order—all worlds would be comparable. There are advantages to allowing = to be partial (see 
the notes at the end of the chapter for more discussion of this point). More importantly, even 
when thinking probabilistically, I view s > s’ as saying that s is much more probable than s’. 
I return to this point below. 

Take an extended causal model to be a tuple M = (S,F,=), where (S,F) is a causal 
model and > is a partial preorder on worlds, which can be used to compare how normal 
different worlds are. In particular, = can be used to compare the actual world to a world 
where some interventions have been made. Which world is the actual world? That depends 
on the context. In a recursive extended causal model, a context wu determines a world denoted 
Sz. We can think of the world sz as the actual world in context w; it is the world that would 
occur given the setting of the exogenous variables in v, provided that there are no external 
interventions. 

Inow modify the HP definitions slightly to take the ordering of worlds into account. I start 
with the original and updated definitions. In this case, define X = #to bea cause of pinan 
extended model M and context tif X = Z is acause of y according to Definition 2.2.1, except 


that in AC2(a), a clause is added requiring that Stig Wea = Sq, where s¢_-, Waaa is 


the world that results by setting X to # and W to @ in context w. This just says that we 
require the witness world to be at least as normal as the actual world. Here is the extended 
AC2(a), which I denote AC2* (a). 


AC2* (a). There i isa partition of V (the set of endogenous variables) into two disjoint subsets 
Z and W with X C Zanda setting z” and w of the variables in X and W, respectively, 
such that $ ¢_ 2 woz ¢ = $a and (M,u) — [X + #,W <-dl-y. 


This can be viewed as a formalization of Kahneman and Miller’s observation that we tend 
to consider only possibilities that result from altering atypical features of a world to make 
them more typical, rather than vice versa. In this formulation, worlds that result from inter- 
ventions on the actual world “come into play” in AC2*(a) only if they are at least as normal 
as the actual world. If we are using the modified HP definition, then AC2(a”’) is extended to 
AC2*(a™) in exactly the same way; the only difference between AC2*(a) and AC2t(a™) is 
that, in the latter just as in AC2(a™)), w is required to consist of the values of the variables 
in W in the actual context (i.e., (M, ia) E W = #). 

Issues of normality are, by and large, orthogonal to the differences between the variants of 
the HP definition. For ease of exposition, in the remainder of this chapter, I use the updated 
HP definition and ACT (a), unless noted otherwise. Using AC2* (a) rather than AC2(a), with 
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reasonable assumptions about normality, takes care of the problems pointed out in Section 2, 
although it raises other issues. 


Example 3.2.1 As I observed, one way to avoid calling the presence of oxygen a cause of 
the forest burning is simply not to include a variable for oxygen in the model or to take the 
presence of oxygen to be exogenous. But suppose someone wanted to consider oxygen to be 
an endogenous variable. Then as long as a world where there is no oxygen is more abnormal 
than a world where there is oxygen, the presence of oxygen is not a cause of the forest fire; 
AC2* (a) fails. 

However, suppose that the fire occurs when a lit match is dropped in a special chamber in 
a scientific laboratory that is normally voided of oxygen. Then there would presumably be 
a different normality ordering. Now the presence of oxygen is atypical, and the witness to 
oxygen being a cause is as normal as (or at least not strictly less normal than) the witness to 
the match being a cause. In this case, I suspect most people would call the presence of oxygen 
a cause of the fire. §f 


Example 3.2.2 Recall Example 2.3.5, where there are potentially many doctors who could 
treat Billy besides his regular doctor. Although it seems reasonable to call Billy’s regular 
doctor a cause of him being sick on Tuesday if the doctor fails to treat him, it does not seem 
reasonable to call all the other doctors causes of him being sick. This is easy to deal with 
if we take normality into account. It seems perfectly reasonable to take the world where 
Billy’s doctor treats Billy to be at least as normal as the world where Billy’s doctor does not 
treat Billy; it also seems reasonable to take the worlds where the other doctors treat Billy 
as being less normal than the world where they do not treat Billy. With these assumptions, 
using AC2*(a), Billy’s doctor not treating Billy is a cause of him being sick, whereas the 
other doctors not treating him are not causes. Thus, using normality allows us to capture 
intuitions that people seem to have without appealing to probabilistic sufficient causality and 
thus requiring a probability on contexts (see, for comparison, the discussion in Section 2.6). 

What justifies a normality ordering where it is more normal for Billy’s doctor to treat him 
than another random doctor? All the interpretations discussed in Section 3.1 lead to this 
judgment. 


= Statistically, patients are far more likely to be treated by their own doctors than by a 
random doctor. 


= We assume that, by default, a patient will be treated by his doctor. 


= A patient’s doctor does typically treat the patient, in the sense that this is a characteristic 
of being someone’s doctor. 


= It is a norm that a patient’s doctor will treat the patient. 


There are philosophers who claim that an omission—the event of an agent not doing 
something—can never count as a cause. Although I do not share this viewpoint, it can be 
accommodated in this framework by taking not acting to always be more normal than acting. 
That is, given two worlds s and s’ that agree in all respects (i-e., in the values of all variables) 
except that in s some agents do not perform an act that they do perform in s’, we would have 
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s > s’. (As an aside, since > is allowed to be partial, if different agents perform acts in s and 
s’, then we can just take s and s’ to be incomparable.) 

We can equally well accommodate a viewpoint that allows all omissions to count as causes. 
This is essentially the viewpoint of the basic HP definition, and we can recover the original 
HP approach (without normality), and hence drop the distinctions between omissions and 
other potential causes, by simply taking all worlds to be equally normal. (Indeed, by doing 
this, we are effectively removing normality from the picture and are back at the original HP 
definition.) 

Of course, the fact that we can accommodate all these viewpoints by choosing an appropri- 
ate normality ordering raises the concern that we can reverse-engineer practically any verdict 
about causality in this way. Although it is true that adding normality to the model gives a 
modeler a great deal of flexibility, the key word here is “appropriate”. A modeler will need 
to argue that the normality ordering that she has chosen is reasonable and appropriate to cap- 
ture the intuitions we have about the situation being modeled. Although there can certainly 
be some disagreement about what counts as “reasonable”, the modeler does not have a blank 
check here. I return to this issue in Section 4.7. J 


These examples already show that, once we take normality into account, causality ascrip- 
tions can change significantly. One consequence of this is that but-for causes may no longer 
be causes (according to all of the variants of the HP definition). The presence of oxygen is 
clearly a but-for cause of fire—if there is no oxygen, there cannot be a fire. But in situations 
where we take the presence of oxygen for granted, we do not want to call the presence of 
oxygen a cause. 

Moreover, although it is still the case that a cause according to the updated HP definition 
is a cause according to the original HP definition, the other parts of Theorem 2.2.3 fail once 
we take normality into account. 


Example 3.2.3 Suppose that two people vote in favor of a particular outcome and it takes 
two votes to win. Each is a but-for cause of the outcome, so all the definitions declare each 
voter to be a cause. But now suppose that we take the normality ordering to be such that it 
is abnormal for the two voters to vote differently, so a world where one voter votes in favor 
and the other votes against is less normal than a world where both vote in favor or both vote 
against. Taking V; and V2 the variables describing how the voters vote, and O to describe 
the outcome, then with this normality ordering, neither V; = 1 nor V2 = 1 is a cause of the 
outcome according to any of the variants of the HP definition, although V; = 1A Vz = lisa 
cause. So, in particular, this means that causes are not necessarily single conjuncts according 
to the original HP definition once it is extended to take normality into account. Moreover, if 
the world where V; = V2 = 0 is also taken to be abnormal, there is no cause at all of O = 1. 


The fact that there are no causes in this case may not seem so unreasonable. With this 
normality setting, the outcome is essentially being viewed as a foregone conclusion and thus 
not in need of a cause. But there are other cases where having no causes seems quite unrea- 
sonable. 


Example 3.2.4 Suppose that Billy is sick, Billy’s regular doctor treats Billy, and Billy re- 
covers the next day. In this case, we would surely want to say that the fact that Billy’s doctor 
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treated Billy is a cause of Billy recovering the next day. Indeed, it is even a but-for cause of 
Billy recovering. But if we assume that it is more normal for Billy’s regular doctor to treat 
Billy when he is sick than not to treat him, then taking normality into account, there is no 
cause of Billy recovering! It is easy to construct many other examples where there are no 
causes if the normal thing happens (a gardener watering her plants is not a cause of the plants 
surviving; if I normally drive home, then the fact that I drove home is not a cause of me being 
at home 20 minutes after I left work; and so on). This problem is dealt by the graded causality 
approach discussed in Section 3.3. If 


Example 3.2.5 Consider Example 2.9.1 again, where there is lightning and two arsonists. 
As we observed, the lightning is no longer a cause of the forest fire according to the mod- 
ified HP definition, since each arsonist by himself is a cause. But suppose that we take a 
world where MD #4 MD’ to be abnormal. This is consistent with the story, which says that 
one arsonist dropping a match “does not happen under normal circumstances”; in particular, 
MD # MD’ is inconsistent with the equations. In that case, neither MD = 1 nor MD’ = 1 is 
a cause of FF = 1 according to modified HP definition in the extended causal model; neither 
satisfies AC2*(a™). Thus, L = 1 A MD = 1 becomes a cause of FF = 1 according to 
modified HP definition, and L = 1 is again part of a cause of FF = 1. 


Example 3.2.6 Consider the sophisticated model M),- of the rock-throwing example, and 
suppose that we declare the world where Billy throws a rock, Suzy doesn’t throw, and Billy 
does not hit abnormal. This world was needed to show that Suzy throwing is a cause according 
to the modified HP definition. Thus, with this normality ordering ST = 1 is not a cause; 
rather, ST = 1/ BT =1isacause, which seems inappropriate. However, ST = 1 remains 
a cause according to the original and updated HP definition, since we can now take the witness 
to be the world where ST’ = 0 and BT = 0; these variants allow the variables in W to take on 
values other than their values in the actual world. Thus, BT’ = 1 is part of a cause according to 
the modified HP definition but not a cause according to the original or updated HP definition. 


Example 3.2.6 shows in part why the normality ordering is placed on worlds, which are 
assignments of values to the endogenous variables only. The witness world where BT = 1, 
BH = 0, and ST = 0 does not seem so abnormal, even if it is abnormal for Billy to throw 
and miss in a context where he is presumed accurate. 

Putting the normality ordering on worlds does not completely save the day, however. For 
one thing, it seems somewhat disconcerting that Billy’s throw should be part of the cause 
according to the modified HP definition in Example 3.2.6. But things get even worse. Suppose 
that we consider the richer model rock-throwing model M7,-, which includes the variable BA, 
for “Billy is an accurate rock thrower’. In M7,,,, we have essentially moved the question of 
Billy’s accuracy out of the context and made it an explicit endogenous variable, that is, it is 
determined by the world. In the actual world, Billy is accurate; that is, BA = 1. (Note that 
we may even have observations confirming this; we could see Billy’s rock passing right over 
where the bottle was just after Suzy’s rock hit it and shattered it.) There is still no problem 
here for the original and updated HP definition. If s is the actual world, then the world 
SBT=0,ST=0,u 1S arguably at least as normal as s,,; although BA = 1 in spr=o,sT=0,u> it is 
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not abnormal for Billy not to hit the bottle if he doesn’t throw. On the other hand, in the world 
SBH=0,8T=0,u Which is needed to show that Suzy’s throw is a cause of the bottle shattering 
according to the modified HP definition, we have BT = 1, BA = 1, and BH = 0: although 
Billy throws and is accurate, he misses the bottle. This doesn’t seem like such a normal 
world. But if we declare this world to be abnormal, then Suzy’s throw is no longer a cause of 
the bottle shattering according to the modified HP definition. 

It gets even worse (at least according to some people’s judgments of causality) if we replace 
Billy the person by Billy the automatic rock-throwing machine, which is programmed to throw 
rocks. Again, it does not seem so bad if BA is not part of the language; programming errors 
or other malfunctions can still result in a rock-throwing machine missing. But with BA in 
the language (in which case we take BA = 1 to mean that the machine doesn’t malfunction 
and is programmed correctly), it does not seem so reasonable to take a world where BA = 1, 
BT =1,and BH = Oto be at least as normal as the actual world. 

Does this represent a serious problem for the modified HP definition (when combined with 
normality)? I would argue that, although at first it may seem reasonable (and is what was 
done in the literature), the real source of the problem is putting normality on worlds. I return 
to this point in Section 3.5, where I consider an alternative approach to defining normality. 
For now I continue with putting normality on worlds and consider an approach that deals with 
the problem in the rock-throwing example in a different way, which has the further advantage 
of providing a more general approach to combining normality with causality. 


3.3. Graded Causation 


AC2*(a) and AC2*(a™), as stated, are all-or-nothing conditions; either it is legitimate to 
consider a particular world (the one that characterizes the witness) or it is not. However, in 
some cases, to show that A is a cause of B, we might need to consider an unlikely intervention; 
it is the only one available. This is exactly what happened in the case of the rock-throwing 
machine. 

As observed in Examples 3.2.3 and 3.2.4, with certain normality orderings, some events 
can have no cause at all (although they would have a cause if normality constraints were ig- 
nored). We can avoid this problem by using normality to rank actual causes, rather than elim- 
inating them altogether. Doing so lets us explain the responses that people make to queries 
regarding actual causation. For example, while counterfactual approaches to causation usu- 
ally yield multiple causes of an outcome y, people typically mention only one of them when 
asked for a cause. One explanation of this is that they are picking the best cause, where best 
is judged in terms of normality. 

To make this precise, say that world s is a witness world (or just witness) for X= being 
a cause of y in context w if for some choice of W, w, and z for which AC2(a) and AC2(b“) 
hold (AC2(a) and AC2(b°) if we are using the original HP definition; AC2(a) if we are 
using the modified HP definition), s is the assignment of values to the endogenous variables 
that results from setting X = 7 and W = wWincontext w. In other words, a witness s is a 
world that demonstrates that AC2(a) holds. (Calling a world s a witness world is consistent 
with calling the triple (W, w, 7”) a witness, as I did in Section 2.2. The triple determines 


Downloaded from http://direct.mit.edu/books/book-pdf/2262849/book_9780262336611.pdf by guest on 04 April 2024 


86 Chapter 3. Graded Causation and Normality 


a unique world, which is a witness world. I continue to use the word “witness” for both a 
witness world and a triple. I hope that the context will make clear which is meant.) 

In general, there may be many witness worlds for X=2 being a cause of y in (M, wd). 
Say that s is a best witness for X= being a cause of y if there is no other witness s’ such 
that s’ > s. (Note that there may be more than one best witness.) We can then grade candidate 
causes according to the normality of their best witnesses. We would expect that someone’s 
willingness to judge X = Zan actual cause of y in (M, i) increases as a function of the 
normality of the best witness world for X =#in comparison to the best witness for other 
candidate causes. Thus, we are less inclined to judge that X = #is an actual cause of ip when 
there are other candidate causes with more normal witnesses. 

Note that once we allow graded causality, we use AC2(a) (resp., AC2(a™)) rather than 
AC2*(a) (resp., AC2*(a™)); normality is used only to characterize causes as “good” or 
“poor”. In Example 3.2.2, each doctor other than Billy’s doctor not treating Billy is still a 
cause of Billy being sick on Tuesday, but is such a poor cause that most people are inclined 
to ignore it if there is a better cause available. On the other hand, in Example 3.2.4, Billy’s 
doctor treating Billy is a cause of Billy recovering the next day, and is the only cause, so it is 
what we take as “the” cause. In general, an event whose only causes are poor causes in this 
sense is one that we expected all along, so doesn’t require an explanation (see Example 3.4.1). 
However, if we are looking for a cause, then even a poor cause is better than nothing. 

Before going on, I should note that it has been common in the philosophy literature on 
causation to argue that, although people do seem to grade causality, it is a mistake to do so 
(see the notes for more discusion on this point). As I said, one of my goals is to get a notion 
of causality that matches how people use the word and is useful. Since it is clear that people 
make these distinctions, I think it is important to get a reasonable definition of causality that 
allows the distinctions to be made. Moreover, I think that people make these distinctions 
because they are useful ones. Part of why we want to ascribe causality is to understand how 
to solve problems. We might well want to understand why Billy’s doctor did not treat Billy. 
We would not feel the need to understand why other doctors did not treat Billy. 


3.4 More Examples 


In this section, I give a few more examples showing the power of adding normality to the 
definition of causality and the advantages of thinking in terms of graded causality. 


3.4.1 Knobe effects 


Recall the vignette from Knobe and Fraser’s experiment that was discussed in Section 3.1, 
where Professor Smith and the administrative assistant both take pens from the receptionist’s 
desk. After reading this vignette, subjects were randomly presented with one of the follow- 
ing propositions and asked to rank their agreement on a 7-point scale from -3 (completely 
disagree) to +3 (completely agree): 


Professor Smith caused the problem. 
The administrative assistant caused the problem. 
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Subjects gave an average rating of 2.2 to the first claim, indicating agreement, but —1.2 to 
the second claim, indicating disagreement. Thus, subjects are judging the two claims differ- 
ently, due to the different normative status of the two actions. (Note that subjects were only 
presented with one of these claims: they were not given a forced choice between the two.) 


In another experiment, subjects were presented with a similar vignette, but this time both 
professors and administrative assistants were allowed to take pens. In this case, subjects tend 
to give intermediate values. That is, when the vignette is changed so that Professor Smith is 
not violating a norm when he takes the pen, not only are subjects less inclined to judge that 
Professor Smith caused the problem, but they are more inclined to judge that the administra- 
tive assistant caused the problem. The most plausible interpretation of these judgments is that 
subjects’ increased willingness to say that the administrative assistant caused the problem is 
a direct result of their decreased willingness to say that Professor Smith caused the problem. 
This suggests that attributions of actual causation are at least partly a comparative affair. 


The obvious causal model of the original vignette has three endogenous variables: 


=» PT = 1 if Professor Smith takes the pen, 0 if she does not; 
» AT = 1 if the administrative assistant takes the pen, 0 if she does not; 


=» PO = 1 if the receptionist is unable to take a message, 0 if no problem occurs. 


There is one equation: PO = min(PT, AT) (equivalently, PO = PT A AT). The exogenous 
variables are such that PT’ and AT are both 1. Therefore, in the actual world, we have 
PT =1, AT =1,and PO = 1. 

Both PT = 1 and AT = 1 are but-for causes of PO = 1 (and so causes according to 
all three variants of the definition). The best witness for PT = 1 being a cause is the world 
(PT =0, AT = 1, PO = 0); the best witness for AT = 1 being a cause is (PT = 1, AT = 
0, PO = 0). (Here and elsewhere, I describe a world by a tuple consisting of the values 
of the endogenous variables in that world.) The original story suggests that the witness for 
PT = 1 being a cause is more normal than the witness for AT = 1 by saying “administrative 
assistants are allowed to take pens, but professors are supposed to buy their own”. According 
to the definition above, this means that people are more likely to judge that PT’ = 1 is an 
actual cause. However, if the vignette does not specify that one of the actions violates a 
norm, we would expect the relative normality of the two witnesses to be much closer, which 
is reflected in how subjects actually rated the causes. 


This example, I believe, illustrates the power of allowing a “graded” approach to causality. 
As I said above, there have been claims that it is a mistake to focus on one cause and ignore 
others (which is an extreme of the graded approach that occurs often in practice). It seems 
to me that the “egalitarian” notion of cause is entirely appropriate at the level of causal struc- 
ture, as represented by the equations of a causal model. These equations represent arguably 
objective features of the world and are not sensitive to factors such as contextual salience and 
human considerations of normality. Pragmatic, subjective factors then determine which actual 
causes we select and label “the” causes (or, at least, as “better” causes). 
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3.4.2 Bogus prevention 


The initial impetus for work on adding normality considerations to basic causal models was 
provided by the examples like the following “bogus prevention” example. 


Example 3.4.1 Assassin is in possession of a lethal poison but has a last-minute change of 
heart and refrains from putting it in Victim’s coffee. Bodyguard puts antidote in the coffee, 
which would have neutralized the poison had there been any. Victim drinks the coffee and 
survives. Is Bodyguard’s putting in the antidote a cause of Victim surviving? Most people 
would say no, but according to the original and updated HP definition, it is. For in the contin- 
gency where Assassin puts in the poison, Victim survives iff Bodyguard puts in the antidote. 
According to the modified HP definition, Bodyguard putting in the antidote is not a cause, but 
is part of a cause (along with Assassin not putting in the poison). Although the latter position 
is perhaps not completely unreasonable (after all, if Bodyguard didn’t put in the antidote and 
Assassin did put in the poison, Victim would have died), most people would not take either 
Bodyguard or Assassin to be causes. Roughly speaking, they would reason that there is no 
need to find a cause for something that was expected all along. ff 


What seems to make Example 3.4.1 particularly problematic is that, if we use the obvious 
variables to describe the problem, the resulting structural equations are isomorphic to those 
in the disjunctive version of the forest-fire example where either the lit match or the lightning 
suffices to start the fire. Specifically, take the endogenous variables in Example 3.4.1 to be 
A (for “assassin does not put in poison”), B (for “bodyguard puts in antidote”), and VS (for 
“victim survives’). That is, 


» A = 1 if Assassin does not put in the poison, 0 if he does; 
=» B = 1 if Bodyguard adds the antidote, 0 if he does not; 
« VS = 1 if Victim survives, 0 if he does not. 


Then A, B, and VS satisfy exactly the same equations as L, MD, and FF, respectively: A 
and B are determined by the environment, and VS = AV B. In the context where there is 
lightning and the arsonist drops a lit match, both the lightning and the match are causes of the 
forest fire according to the original and updated HP definition (and parts of causes according 
to the modified HP definition), which seems reasonable. For similar reasons, in this model, 
the original and updated HP definition declare both A = 1 and B = 1 to be causes of VS = 1 
(and the modified HP definition declares them to be parts of causes). But here it does not 
seem reasonable that Bodyguard’s putting in the antidote is a cause or even part of a cause. 
Nevertheless, it seems that any definition that just depends on the structural equations is bound 
to give the same answers in these two examples. 

Using normality gives us a straightforward way of dealing with the problem. In the actual 
world, A = 1,B = 1, and VS = 1. The witness world for B = 1 being a cause of 
VS = 1is the world where A = 0, B = 0, and VS' = 0. If we make the assumption that A is 
typically 1 and that B is typically 0, this leads to a normality ordering in which the two worlds 
(A=1,B=1, VS = 1) and (A =0, B = 0, VS = 0) are incomparable; although both are 
abnormal, they are abnormal in incomparable ways. Since the unique witness for B = 1 to 
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be an actual cause of VS = 1 is incomparable with the actual world, using AC2*(a), we get 
that B = 1 is not an actual cause of VS = 1. 

Note that if we require AC2*(a), A = 1 is not a cause of VS = 1 either; the world 
(A = 0,B = 0, VS = 0) is also the unique witness world for A = 1 being a cause of 
VS = 1. That means VS = 1 has no causes. Again, in this case, we would typically not 
look for an explanation or a cause of VS = 1 because it was expected all along. However, if 
we are looking for a cause, using graded causality, both A = 1 and B = 1 count as causes 
of VS = 1 according to the original and updated HP definition (extended to take normality 
into account, of course), although they get a “poor” grade; with the modified HP definition, 
the conjunction A = 1 A B = 1 counts as a cause. Again, we are inclined to accept a “poor” 
cause as a cause if there are no better alternatives available. 

Now suppose that we consider a world where assassination is typical; assassins are hiding 
behind every corner. Would we then want to take worlds where A = 0 to be more normal than 
worlds where A = 1, all else being fixed? Indeed, even if assassinations were not common- 
place, isn’t it typical for assassins to poison drinks? That is what assassins do, after all. Here 
we are dealing with two different interpretations of “normality”. Even if assassinations occur 
frequently, they are morally wrong, and hence abnormal in the moral sense. Still, perhaps we 
would be more willing to accept the assassin not putting in poison as a cause of the victim 
surviving in that case; we certainly wouldn’t take VS = 1 to be an event that no longer re- 
quires explanation. Indeed, imagine the assassin master sent his crack assassin out to murder 
the victim and then discovers that the victim is still alive. Now he would want an explana- 
tion, and the explanation that the assassin did not actually put in the poison now seems more 
reasonable. 

Although (despite the caveats just raised) using normality seems to deal with the assassin 
problem, there is another, arguably better, way of modeling this example that avoids issues of 
normality altogether. The idea is similar in spirit with how the rock-throwing problem was 
handled. We add a variable PN to the model, representing whether a chemical reaction takes 
place in which poison is neutralized. Of course PN = 1| exactly if the assassin puts in the 
poison and Bodyguard adds the antidote. Thus, the model has the following two equations: 


» PN =7AA B (ie., (1 — A) x B) and 
» VS = max(A, PN) (ic., AV PN). 


Thus, in the actual world, where A = 1 and B = 1, PN = 0. The poison is not neutralized 
because there was never poison there in the first place. With the addition of PN, B = 1 is no 
longer a cause of VS = 1 according to the original or updated HP definitions, without taking 
normality into account. Clearly, if B = 1 were to be a cause, the set W would have to include 
A, and we would have to set A = 0. The question is whether PN is in Z or W. Either way, 
AC2(b°) would not hold. For if PN € VA , then since its original value is 0, if A is set to 0 and 
PN is set to its original value of 0, then VS = 0 evenif B = 1. In contrast, if PN © Ww, we 
must set PN = 0 for AC2(a) to hold, and with this setting, AC2(b°) again does not hold. 
However, A = 1 isa cause of VS' = 1, according to all variants of the HP definition. (If we 
fix PN at its actual value of 0 and set A = 0, then VS = 0.) It follows that B = 1 is not even 
part of a cause of VS = 1 according to the modified HP definition. (This also follows from 
Theorem 2.2.3.) Although it seems reasonable that B = 1 should not be a cause of VS = 1, 


Downloaded from http://direct.mit.edu/books/book-pdf/2262849/book_9780262336611.pdf by guest on 04 April 2024 


90 Chapter 3. Graded Causation and Normality 


is it reasonable that A = 1 should be a cause? Anecdotally, most people seem to find that less 
objectionable than B = 1 being a cause. Normality considerations help explain why people 
are not so enthusiastic about calling A = 1 a cause either, although it is a “better” cause than 
B=, 

Interestingly, historically, this example was the motivation for introducing normality con- 
siderations. Although normality is perhaps not so necessary to deal with it, it does prove 
useful in a small variant of this example. 


Example 3.4.2 Again, suppose that Bodyguard puts an antidote in Victim’s coffee, but now 
Assassin puts the poison in the coffee. However, Assassin would not have put the poison in 
the coffee if Bodyguard hadn’t put the antidote in. (Perhaps Assassin is putting in the poison 
only to make Bodyguard look good.) Formally, we have the equation A = —B. Now Victim 
drinks the coffee and survives. 

Is Bodyguard putting in the antidote a cause of Victim surviving? It is easy to see that, 
according to all three variants of the HP definition, it is. If we fix A = 0 (its value in the 
actual world), then Victim survives if and only if Bodyguard puts in the antidote. Intuition 
suggests that this is unreasonable. By putting in the antidote, Bodyguard neutralizes the effect 
of the other causal path he sets in action: Assassin putting in the poison. 

Here normality considerations prove quite useful. The world where Bodyguard doesn’t put 
in the antidote but Assassin puts in the poison anyway (i.e., A = B = 0) directly contradicts 
the story. Given the story, this world should be viewed as being less normal than the actual 
world, where A = 0 and B = 1. Thus, B = 1 is not acause of V = 1 in the extended model, 
as we would expect. 


As the following example shows, normality considerations do not solve all problems. 


Example 3.4.3 Consider the train in Example 2.3.6 that can go on different tracks to reach 
the station. As we observed, if we model the story using only the variables F' (for “flip’’) 
and T (for “‘track’’), then flipping the switch is not a cause of the train arriving at the station. 
But if we add variables LB and RB (for left-hand track blocked and right-hand track blocked, 
with LB = RB = O in the actual context), then it is, according to the original and updated 
HP definition, although not according to the modified HP definition. Most people would not 
consider the flip a cause of the train arriving. We can get this outcome even with the original 
and updated definition if we take normality into account. Specifically, suppose that the train 
was originally going to take the left track and took the right track as a result of the switch. 
To show that the switch is a cause, consider the witness world where LB = 1 (and RB = 0). 
If we assume, quite reasonably, that this world is less normal than the actual world, where 
LB = RB = 0, then flipping the switch no longer counts as a cause, even with the original and 
updated HP definition. 

But normality doesn’t quite save the day here. Suppose that we consider the context where 
both tracks are blocked. In this case, the original and updated HP definitions declare the flip 
a cause of the train not arriving (by considering the contingency where LB = 0). Normality 
considerations don’t help; the contingency where LB = 0 is more normal than the actual 
situation, where the track is blocked. The modified definition again matches intuitions better; 
it does not declare flipping the switch a cause. 
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As pointed out in the notes to Chapter 2, there is yet another way of modeling the story. 
Instead of the variables LB and RB, we use the variables LT (“train went on the left track’’) 
and RT (“train went on the right track’’). In the actual world, F = 1, RT = 1, LT = 0, and 
A= 1. Now F = 1is acause of A = 1, according to all the variants of the HP definition. If 
we simply fix LT = 0 and set F’ = 0, then A = 0. But in this case, normality conditions do 
apply: the world where the train does not go on the left track despite the switch being set to 
the left is less normal than the actual world. ff 


3.4.3 Voting examples 


Voting can lead to some apparently unreasonable causal outcomes (at least, if we model things 
naively). Consider Jack and Jill, who live in an overwhelmingly Republican district. As 
expected, the Republican candidate wins with an overwhelming majority. Jill would normally 
have voted Democrat but did not vote because she was disgusted by the process. Jack would 
normally have voted Republican but did not vote because he (correctly) assumed that his 
vote would not affect the outcome. In the naive model, both Jack and Jill are causes of the 
Republican victory according to all the variants of the HP definition. For if enough of the 
people who voted Republican had switched to voting Democrat or abstaining, then if Jack (or 
Jill) had voted Democrat, the Democrat would have won or it would have been a tie, whereas 
the Republican would have won had they abstained. 

We can do better by constructing a model that takes these preferences into account. One 
way to do so is to assume that their preferences are so strong that we may as well take them for 
granted. Thus, the preferences become exogenous; the only endogenous variables are whether 
they vote (and the outcome); if they vote at all, they vote according to their preferences. In 
this case, Jack’s not voting is not a cause of the outcome, but Jill’s not voting is. 

More generally, with this approach, the abstention of a voter whose preference is made ex- 
ogenous and is a strong supporter of the victor does not count as a cause of victory. This does 
not seem so unreasonable. After all, in an analysis of a close political victory in Congress, 
when an analyst talks about the cause(s) of victory, she points to the swing voters who voted 
one way or the other, not the voters who were taken to be staunch supporters of one particular 
side. 

That said, making a variable exogenous seems like a somewhat draconian solution to the 
problem. It also does not allow us to take into account smaller gradations in depth of feeling. 
At what point should a preference switch from being endogenous to exogenous? We can 
achieve the same effect in an arguably more natural way by using normality considerations. 
In the case of Jack and Jill, we can take voting for a Democrat to be highly abnormal for Jack 
and voting for a Republican to be highly abnormal for Jill. To show that Jack (resp., Jill) 
abstaining is a cause of the victory, we need to consider a contingency where Jack (resp., Jill) 
votes for the Democratic candidate. This would be a change to a highly abnormal world in 
the case of Jack but to a more normal world in the case of Jill. Thus, if we use normality as a 
criterion for determining causality, Jill would count as a cause, but Jack would not. If we use 
normality as a way of grading causes, Jack and Jill would still both count as causes for the 
victory, but Jill would be a much better cause. More generally, the more normal it would be 
for someone who abstains to vote Democrat, the better a cause that voter would be. The use 
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of normality here allows for a more nuanced gradation of cause than the rather blunt approach 
of either making a variable exogenous or endogenous. 

Now consider a vote where everyone can vote for one of three candidates. Suppose that the 
actual vote is 17—2-0O (i.e., 17 vote for candidate A, 2 for candidate B, and none for candidate 
C). Then not only is every vote for candidate A a cause of A winning, every vote for B is 
also a cause of A winning according to the original and updated HP definition. To see this, 
consider a contingency where 8 of the voters for A switch to C. Then if one of the voters for 
B votes for C, the result is a tie; if that voter switches back to B, then A wins (even if some 
subset of the voters who switch from A to C' switch back to A). According to the modified 
HP definition, each voter for B is part of a cause of A’s victory. The argument is essentially 
the same: had a voter for B switched to C’ along with eight of the voters for A, it would have 
been a tie, so A would not have won. 

Is this reasonable? What makes it seem particularly unreasonable is that if it had just 
been a contest between A and B, with the vote 17—2, then the voters for B would not have 
been causes of A winning. Why should adding a third option make a difference? Arguably, 
the modified HP definition makes it clear why adding the third option makes a difference: the 
third candidate could win according to the appropriate circumstances, and those circumstances 
include the voters for B changing their votes. 

Indeed, it is clear that adding a third option can well make a difference in some cases. For 
example, we speak of Nader costing Gore a victory over Bush in the 2000 election. But we 
don’t speak of Gore costing Nader a victory, although in a naive HP model of the situation, 
all the voters for Gore are causes of Nader not winning as much as the voters for Nader are 
causes of Gore not winning. The discussion above points to a way out of this dilemma. If a 
sufficiently large proportion of Bush and Gore voters are taken to be such strong supporters 
that they will never change their minds, and we make their votes exogenous, then it is still the 
case that Nader caused Gore to lose (assuming that most Nader voters would have voted for 
Gore if Nader hadn’t run), but not the case that Gore caused Nader to lose. Similar consid- 
erations apply in the case of the 17—2 vote. And again, we can use normality considerations 
to give arguably more natural models of these examples. As we shall see in Chapter 6, using 
ideas of blame and responsibility give yet another approach to dealing with these concerns. 


3.4.4 Causal chains 


As I argued in Section 2.4, causation seems transitive when there is a simple causal chain 
involving but-for causality, and the final effect counterfactually depends on the initial cause. 
By contrast, the law does not assign causal responsibility for sufficiently remote consequences 
of an action. For example, in Regina v. Faulkner, a well-known Irish case, a lit match aboard a 
ship caused a cask of rum to ignite, causing the ship to burn, which resulted in a large financial 
loss by Lloyd’s insurance, leading to the suicide of a financially ruined insurance executive. 
The executive’s widow sued for compensation, and it was ruled that the negligent lighting of 
the match was not a cause (in the legally relevant sense) of his death. By taking normality 
into account, we can make sense of this kind of attenuation. 

We can represent the case of Regina v. Faulkner using a causal model with nine variables: 


« M = 1 if the match is lit, 0 if it is not; 
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R = 1 if there is rum in the vicinity of the match, 0 if not; 


RI = 1if the rum ignites, 0 if it does not; 


F = 1 if there is further flammable material near the rum, 0 if not; 


SD = 1 if the ship is destroyed, 0 if not; 


LI = 1 if the ship is insured by Lloyd’s, 0 if not; 


LL = 1 if Lloyd’s suffers a loss, 0 if not; 
=» KU = 1 if the insurance executive was mentally unstable, 0 if not; and 
« ES = 1 if the executive commits suicide, 0 if not. 
There are four structural equations: 
» RI = min(M, R); 
» SD =min(RI, F); 
» LL =min(SD, LI); and 
s BS = min(LL, EV). 


This model is shown graphically in Figure 3.1. The exogenous variables are such that /, 
R, F, LI, and EU are all 1, so in the actual world, all variables take the value 1. Intuitively, 
the events IZ = 1, RI = 1, SD = 1, LL = 1, and ES = 1 form a causal chain. The first 
four events are but-for causes of ES = 1, and so are causes according to all variants of the 
HP definition. 


M RI SD LL ES 


Figure 3.1: Attenuation in a causal chain. 


Let us now assume that, for the variables M, R, F’', LI, and FU, 0 is the typical value, and 
1 is the atypical value. Consider a very simple normality ordering, which is such that worlds 
where more of these variables take the value 0 are more normal. For simplicity, consider just 
the first and last links in the chain of causal reasoning: LL = 1 and M = 1, respectively. The 
worldw = (M =0,R=1,RI =0,F =1,SD=0,L1 =1,LL =0, EU =1, ES = 0) 
is a witness of M = 1 being a cause of ES = 1 (for any of the variants of the HP definition). 
This is quite an abnormal world, although more normal than the actual world, so WM = 1 
does count as an actual cause of ES = 1, even using AC2*(a). However, note that if any 
of the variables R, Ff’, LI, or EU is set to 0, then we no longer have a witness. Intuitively, 
ES counterfactually depends on M only when all of these other variables take the value 1. 
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Now consider the event LZ = 1. The world w’ (M 0,R 0,RI = 0,F =0, 
SD =0,LI =0,LL = 0, EU = 1, ES = 0) is a witness for LL = 1 being an actual cause 
of ES = 1 according to the original and updated HP definition. World w’ is significantly 
more normal than the best witness for Z = 1 being a cause. Intuitively, LZ = 1 needs fewer 
atypical conditions to be present in order to generate the outcome ES = 1. It requires only 
the instability of the executive, but not the presence of rum, other flammable materials, and so 
on. So although both LL = 1 and M = 1 count as causes using AC2* (a), LL = 1 is a better 
cause, So we would expect that people would be more strongly inclined to judge that LZ = 1 
is an actual cause of ES = 1 than M = 1. 

Once we take normality into account, we can also have a failure of transitivity. For 
example, suppose that we slightly change the normality ordering so that the world w = 
(M=0,R=1,RI=0,F =1,5D =0,LI =1,LL=0, EU =1, ES = 0) that is a wit- 
ness of IM = 1 being a cause of E'S = 1 is less normal than the actual world, but otherwise 
leave the normality ordering untouched. Now M = 1 is no longer a cause of ES = 1 using 
AC2*(a). However, causality still holds for each step in the chain: M = 1 is a cause of 
RI =1, RI = lisacause of SD = 1, SD = 1 isacause of LL = 1, and LL = 1 is a cause 
of ES = 1. Thus, normality considerations can explain the lack of transitivity in long causal 
chains. 

Note that the world w’ is not a witness according to the modified HP definition since, in w’, 
the values of the variables R, Ff’, and LI differ from their actual values. Normality does not 
seem to help in this example if we use the modified HP definition. However, as we shall see 
in Section 6.2, we can deal with this example perfectly well using the modified HP definition 
if we use ideas of blame. 

It is worth noting that, with the original and updated HP definition, the extent to which we 
have attenuation of actual causation over a causal chain is not just a function of the number of 
links in the chain. It is, rather, a function of how abnormal the circumstances are that must be 
in place in order for the causal chain to run from start to finish. 


3.4.5 Legal doctrines of intervening causes 


In general, the law holds that one is not causally responsible for some outcome that occurred 
only due to either a later deliberate action by some other agent or some very improbable event. 
For example, if Anne negligently spills gasoline, and Bob carelessly throws a cigarette into 
the spilled gasoline, then Anne’s action is a cause of the fire. But if Bob maliciously throws 
a cigarette into the spilled gasoline, then Anne is not considered a cause. (This example is 
based on the facts on an actual case: Watson v. Kentucky and Indiana Bridge and Railroad.) 
This kind of reasoning can also be modeled using an appropriate normality ordering, as I now 
show. 

To fully capture the legal concepts, we need to represent the mental states of the agents. 
We can do this with the following six variables: 


« AN = 1 if Anne is negligent, 0 if she isn’t; 
» AS = 1if Anne spills the gasoline, 0 if she doesn’t; 


« BC =1if Bob is careless (i.e. doesn’t notice the gasoline), 0 if not; 
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« BM = 11f Bob is malicious, 0 otherwise; 
=» BT = 1 if Bob throws a cigarette, 0 if he doesn’t; and 
= F = 1 if there is a fire, 0 if there isn’t. 
We have the following equations: 
» F = min(AS, BT) (ie., F = AS A BT); 
» AS = AN; and 
» BT = max(BC, BM,1— AS) (ie., BT = BC V BM V AAS). 


This model is shown graphically in Figure 3.2. (Note that the model incorporates certain 
assumptions about what happens in the case where Bob is both malicious and careless and in 
the cases where Anne does not spill gasoline; what happens in these cases is not clear from 
the usual description of the example. These assumptions do not affect the analysis.) 


AN BC BM 


AS 
BT 


F 


Figure 3.2: An intervening cause. 


It seems reasonable to assume that BM, BC, and AN typically take the value 0. But more 
can be said. In the law, responsibility requires a mens rea—literally, a guilty mind. Mens rea 
comes in various degrees. Starting with an absence of mens rea and then in ascending order 
of culpability, these are: 


= prudent and reasonable: the defendant behaved as a reasonable person would; 

= negligent: the defendant should have been aware of the risk of harm from his actions; 
= reckless: the defendant acted in full knowledge of the harm her actions might cause; 
= criminal/intentional: the defendant intended the harm that occurred. 


We can represent this scale in terms of decreasing levels of typicality. Specifically, in decreas- 
ing order of typicality, we have 


1. BC =1; 
2. AN =1; 
3. BM =1. 
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Thus, although “carelessness” is not on the scale, I have implicitly taken it to represent less 
culpability than negligence. 

In all the examples up to now, [have not needed to compare worlds that differed in the value 
of several variables in terms of normality. Here, when comparing the normality of worlds in 
which two of these variables take the value 0 and one takes the value |, we can capture legal 
practice by taking the world with BC = 1 to be most normal, the world with AN = 1 to be 
next most normal, and the world with BM = 1 to be least normal. Consider first the case 
where Bob is careless. Then in the actual world we have 


(BM =0, BC =1,BT =1,AN =1,AS =1,F =1). 


In the structural equations, /' = 1 depends counterfactually on both BC = 1 and AN = 1. 
Thus, these are both but-for causes, so are causes according to all the variants of the HP 
definition. The best witness for AN = 1 being a cause is 


(BM =0, BC =1, BT =1, AN =0, AS =0,F =0), 


and the best witness for BC’ = 1 being a cause is 


(BM =0, BC =0, BT =0,AN =1, AS =1,F =0). 


Both of these worlds are more normal than the actual world. The first is more normal because 
AN takes the value 0 instead of 1. The second world is more normal than the actual world 
because BC takes the value 0 instead of 1. Hence, according to AC2*(a), both are actual 
causes. However, the best witness for AN = 1 is more normal than the best witness for 
BC =1. The former witness has BC = 1 and AN = 0, and the latter witness has BC’ = 0 
and AN = 1. Since AN = 1 is more atypical than BC' = 1, the first witness is more normal. 
This means that we are more inclined to judge Anne’s negligence a cause of the fire than 
Bob’s carelessness. 
Now consider the case where Bob is malicious. The actual world is 


(BM =1, BC =0, BT =1,AN =1, AS =1,F =1). 


Again, without taking normality into account, both BM = 1 and AN = 1 are actual causes. 
The best witness for AN = 1 being a cause is 


(BM =1,BC =0,BT =1,AN =0,AS =0,F =0), 
and the best witness for BIZ = 1 is 
(BM =0,BC =0,BT =0,AN =1,AS =1,F =0). 


However, now the best witness for AN = 1 is less normal than the best witness for BM = 1 
because BM = 1 is more atypical than AN = 1. So using this normality ordering, we 
are more inclined to judge that Bob’s malice is an actual cause of the fire than that Anne’s 
negligence is. 

Recall that in Example 3.4.1, judgments of the causal status of the administrative assis- 
tant’s action changed, depending on the normative status of the professor’s action. Something 
similar is happening here: the causal status of Anne’s action changes with the normative sta- 
tus of Bob’s action. This example also illustrates how context can play a role in determining 
what is normal and abnormal. In the legal context, there is a clear ranking of norm violations. 
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3.5 An Alternative Approach to Incorporating Normality 


I now present an alternative approach to incorporating normality concerns in causal models, 
inspired by the discussion after Example 3.2.6. As suggested earlier, this approach is also 
more compatible with the way that probability was incorporated into causal models. The 
first step is to put the normality ordering on causal settings. Since, for the purposes of this 
discussion, the causal model is fixed, this amounts to putting a normality ordering on contexts, 
rather than on worlds. This means that normality can be viewed as a qualitative generalization 
of probability. But, as I suggested before, to the extent that we are thinking probabilistically, 


99 


we should interpret u > i’ as meaning “tv is much more probable than w 


Putting the normality ordering on contexts clearly does not by itself solve the problem. 
Indeed, as noted in the discussion immediately after Example 3.2.6, the motivation for putting 
the normality ordering on worlds rather than contexts was to solve problems that resulted from 
putting the normality ordering on context! There is an additional step: lifting the normality 
order on contexts to a normality ordering on sets of contexts. Again, this is an idea motivated 
by probability. We usually compare events, that is, sets of worlds, using probability, not just 
single worlds; I will also be comparing sets of contexts in the alternative approach to using 
normality. 


When using probability (ignoring measurability concerns), the probability of a set is just 
the sum of the probabilities of the elements in the set. But with normality, “sum” is in general 
undefined. Nevertheless, we can lift a partial preorder on an arbitrary set S' to a partial preorder 
on 2° (the set of subsets of $) in a straightforward way. Given sets A and B of contexts, say 
A >° B if, for all contexts @ € B, there exists a context wu’ € A such that wi’ > w. (I use the 
superscript e to distinguish the order =° on events—sets of contexts—from the underlying 
order > on contexts.) In words, this says that A is at least as normal as B if, for every context 
in B, there is a context in A that is at least as normal. 


Clearly we have u > w’ iff {u} =*° {u}, so = extends > in the obvious sense. The 
definition of —° has a particularly simple interpretation if = is a total preorder on contexts 
and A and B are finite. In that case, there is a most normal context wu, in A (there may be 
several equally normal contexts in A that are most normal; in that case, let u4 be one of them) 
and a most normal context ug in B. It is easy to see that A >° Biffu, = ug. That is, A 
is at least as normal as B if the most normal worlds in A are at least as normal as the most 
normal worlds in B. Going back to probability, this says that the qualitative probability of an 
event is determined by its most probable element(s). For those familiar with ranking functions 
(see the notes at the end of the chapter), this ordering on events is exactly that determined by 
ranking functions. 


Taking the causal model MV to be fixed, let |p] denote the event corresponding to a propo- 
sitional formula ¢; that is, [p] = {u@ : (W,%) —& ¢} is the set of contexts where y is true. 
I now modify AC2*(a) and AC2(a”) so that instead of requiring that S8ie Waag 2 Sa 1 
require that [X = #@” AW = wA-y] &* [X = £A gy]; that is, the set of worlds where 
the witness X = # AW =A —y holds is at least as likely as the set of worlds satisfying 
X=ZAp. 
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Doing this solves the problem in the rock-throwing example. For one thing, with this 
change, the addition of BA (the accuracy of Billy’s throw) to the language in the rock- 
throwing example has no impact. More generally, whether something like accuracy is mod- 
eled by an endogenous variable or an exogenous variable has no impact on normality consid- 
erations; in either case, we would still consider the same set of contexts when comparing the 
normality of [BH =0A ST =0A BS = Oj and [ST =1A BS = 1]. It seems reasonable 
to view BH = 0A ST =0A BS = Oas being relatively normal. Even if Billy typically 
throws and typically is an accurate shot, he does occasionally not throw or throw and miss. 
Thinking probabilistically, even if the probability of the event [BH =0A ST =0ABS =O] 
is lower than that of [ST = 1A BS = 1], it is not sufficiently less probable to stop the 
former from being as normal as the latter. (Here I am applying the intuition that > represents 
“much more probable than” rather than just “more probable than”.) This viewpoint makes 
sense even if Billy is replaced by a machine. Put another way, although it may be unlikely that 
the rock-throwing machine does not throw or throws and misses, it may not be viewed as all 
that abnormal. 

Going back to Example 3.2.6, it is still the case that Suzy’s throw by itself is not a cause 
of the bottle shattering if [BH = 0A ST = 0A BS = Oj is not at least as normal as 
[ST = 1A BS = 1] according to the modified HP definition (extended to take normality into 
account). Similarly, Suzy’s throw is not a cause according to the original and updated HP defi- 
nition if [BT =O0AST = 0A BS = OJ is not at least as normal as [ST = 1A BS = 1]. Since 
Billy not throwing is only one reason that Billy might not hit, [BH =0A ST =0A BS = 0] 
is at least as normal as [BT = 0A ST = 0A BS = OJ]. Thus, using the witness 
BT = 0A ST = 0 will not make Suzy’s throw a cause using the original and updated 
HP definition if it is not also a cause using the modified HP definition. Since we can always 
take BH = 0 to be the witness for the original and updated HP definition, all the variants of 
the HP definition will agree on whether ST’ = 1 is a cause of BS = 1. 

Is it so unreasonable to declare both Billy and Suzy throwing a cause of the bottle shattering 
if [BH =0A ST = 0A BS = OJ is not at least as normal as [ST = 1 A BS = 1]? Roughly 
speaking, this says that a situation where the bottle doesn’t shatter is extremely abnormal. 
Suzy’s throw is not the cause of the bottle shattering; it was going to happen anyway. The 
only way to get it not to happen is to intervene on both Suzy’s throw and Billy’s throw. 

We can also apply these ideas to graded causality. Instead of comparing witness worlds, 
I now compare witness events. But rather than just considering the best witnesses, I can 


now consider all witnesses. That is, suppose that (W1, 21, 2/),...,(Wx, We, Z,) are the 
witnesses to X = @ being a cause of y in (M, tv), and (W1, wi, y,),...,(Wi,, wi, ,) are 


the witnesses to Y = 7 being a cause of y in (M, i’). Now we can say that X = 7 is at least 
as good a cause of yin (M, i) as Y = Vif 


[(X =H AW, =i) V...V (X= HA We = te) A-9] G1) 
ee [(Y =H AW =H) V...V(V =F, AW, = Gn) A 7] 


That is, X = Zis at least as good a cause of y in (IM, iw) as Y = y if the disjunction of the 
(formulas describing the) witness worlds for X = Z being a cause of y in (MM, w) is at least 
as normal as the disjunction of the witness worlds for Y= y being a cause of y in (IM, u). 
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We can further extend this idea to parts of causes. (Recall that I earlier made the claim that 
it is arguably better to redefine cause to be what I am now calling part of a cause.) Suppose 


that X, = %,,...,X, = zp are all the causes of y in (W,%) that include X = vasa 
conjunct, with (W1, w1,7),...,(W, We, Z,) the corresponding witnesses. (It may be the 
case that x; = Xj and ¢; = Z, for 7 # j’ although the corresponding witnesses are 
different; a single cause may have a number of distinct witnesses.) Similarly, suppose that 
Y, = 9i,.--,¥m = Ym are all the causes of y in (M, i) that include Y = y as a conjunct, 
with (W{,w,9)),---,(Wi,, Ww, ¥},,) the corresponding witnesses. Again, we can say that 


X = x isat least as good a part of a cause for y in (M, tw) as Y = yif (3.1) holds. 


The analysis for all the examples in Section 3.4 remains unchanged using the alternative 
approach, with the exception of the causal chain example in Section 3.4.4. Now the analysis 
for the original and updated HP definition becomes the same as for the modified HP definition. 
To determine the best cause, we need to compare the normality of [JZ = 1A ES = 0] to that 
of [LL = 1/ ES =O]. There is no reason to prefer one to the other; the length of the causal 
chain from M to ES plays no role in this consideration. As I said, this example can be dealt 
with by considering blame, without appealing to normality (see Section 6.2). 


This modified definition of graded causality, using (3.1), has some advantages. 


Example 3.5.1 There is a team consisting of Alice, Bob, Chuck, and Dan. In order to com- 
pete in the International Salsa Tournament, a team must have at least one male and one female 
member. All four of the team members are supposed to show up for the competition, but in 
fact none of them does. According to the original and updated definition, each of Alice, Bob, 
Chuck, and Dan is a cause of the team not being able to compete. However, using the alter- 
native approach to graded causality, with (3.1), Alice is a better cause than Bob, Chuck, or 
Dan. Every context where Alice shows up and at least one of Bob, Chuck, or Dan shows up 
gives a witness world for Alice not showing up being a cause; the only witness worlds for 
Bob not showing up being a cause are those where both Alice and Bob show up. Thus, the set 
of witnesses for Alice being a cause is a strict superset of those for Bob being a cause. Under 
minimal assumptions on the normality ordering (e.g., the contexts where just Alice and Bob 
show up, the contexts where just Alice and Chuck show up, and the contexts where Alice and 
Dan show up are all mutually incomparable) a cause are incomparable both to the contexts 
where Chuck is a cause and the contexts where Dan is a cause, and similarly for Chuck and 
Dan), Alice is a better cause. 


According to the modified HP definition, Alice by herself is not a cause for the team 
not being able to compete; neither is Bob. However, both are parts of a cause. The same 
argument then shows that Alice is a better part of a cause than Bob according to the modified 
HP definition. 


With the normality definition in Section 3.2, each of Alice, Bob, Charlie, and Dan would 
be considered equally good causes according to the original and updated HP definition, and 
equally good parts of causes according to the modified HP definition. People do judge Alice 
to be a better cause; the alternative approach seems to capture this intuition better. (See also 
the discussion of Example 6.3.2 in Chapter 6.) fi 
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Although this alternative approach to AC2*(a) and graded causality deals with the rock- 
throwing example and gives reasonable answer in Example 3.5.1, it does not solve all the 
problems involving normality and causality, as the following example shows. 


Example 3.5.2 A and B each control a switch. There are wires going from an electricity 
source to these switches and then continuing on to C’. A must first decide whether to flip his 
switch left or right, then B must decide (knowing A’s choice). The current flows, resulting in 
a bulb at C' turning on, iff both switches are in the same position. B wants to wants to turn on 
the bulb, so flips her switch to the same position as A does, and the bulb turns on. 

Intuition suggests that A’s action should not be viewed as a cause of the C’ bulb being on, 
whereas B’s should. But suppose that we consider a model with three binary variables, A, 
B, and C. A = 0 means that A flips his switch left and A = 1 means that A flips his switch 
right; similarly for B. C = 0 if the bulb at C does not turn on, and C' = 1 if it does. In this 
model, whose causal network is given in Figure 3.3, A’s value is determined by the context, 
and we have the equations 


» B= Aand 
»=C=1iffA=B. 


C 


Figure 3.3: Who caused the bulb to turn on? 


Without taking normality considerations into account, in the context where A = 1, both 
A = 1and B = 1 are causes of C' = 1, according to all variants of the HP definition. B = 1 
is in fact a but-for cause of C' = 1; to show that A = 1 is a cause of C = 1 for all variants of 
the HP definition, we can simply hold B at its actual value 1 and set A = 0. Similarly, for all 
other contexts, both A and B are causes of the outcome. However, people tend to view B as 
the only cause in all contexts. 

Turning on a light bulb is ethically neutral. Does the situation change at all if instead of a 
bulb turning on, the result of the current flowing is that a person gets a shock? 

Before considering normality, it is worth noting that the causal network in Figure 3.3 is 
isomorphic to that given in Figure 2.8 for Billy’s medical condition. Moreover, we can make 
the stories isomorphic by modifying the story in Example 2.4.1 slightly. Suppose we say that 
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Billy is fine on Wednesday morning if exactly one doctor treats him and sick otherwise (so 
now BMC' becomes a binary variable, with value 0 if Billy is fine on Wednesday morning 
and 1 if he is sick). If both doctors treat Billy, then I suspect most people would say that 
Tuesday’s doctor is the cause of Billy being sick; but if neither does, I suspect that Monday’s 
doctor would also be viewed as having a significant causal impact. Moreover, in the normal 
situation where Monday’s doctor treats Billy and Tuesday’s does not, I suspect that Monday’s 
doctor would be viewed as more of a cause of Billy feeling fine on Wednesday than Tuesday’s 
doctor. 


If the HP definition is going to account for the differences between the stories, somehow 
normality considerations should do it. They do help to some extent in the second story. In 
particular, normality considerations work just right in the case that both doctors treat Billy (if 
we take the most normal situation to be the one where Monday’s doctor treats him and Tues- 
day’s doesn’t). But if neither doctor treats Billy, the same normality considerations suggest 
that Monday’s doctor is a better cause than Tuesday’s doctor. Although this may not be so 
unreasonable, I suspect that people would still be inclined to ascribe a significant degree of 
causality to Tuesday’s doctor. I return to this point below. 


Now suppose instead that (what we have taken to be) the normal thing happens: just Mon- 
day’s doctor treats Billy. As expected, Billy feels fine on Wednesday. Most people would say 
that Monday’s doctor is the cause of Billy feeling fine, not Tuesdays’s doctor. Unfortunately, 
AC2*(a) (or AC2*(a™)) is violated no matter who we try to view as the cause. What about 
graded causality? The witness for Monday’s doctor being a cause is 8 )y77=0, 7T=0,u; the wit- 
ness for Tuesday’s doctor being a cause is 5 77=1,1. In the first witness, we get two violations 
to normality: Billy’s doctor doesn’t do what he is supposed to do, nor does Tuesday’s doc- 
tor. In the second witness, we get only one violation (Tuesday’s doctor treating Billy when 
he shouldn’t). However, a case can still be made that the world where neither doctor treats 
Billy (MT = 0, TT = 0) should be viewed as more normal than the world where both do 
(MT = 1,TT = 1). The former witness might arise if, for example, both doctors are too 
busy to treat Billy. If Billy’s condition is not that serious, then they might sensibly decide that 
he can wait. However, for the second witness to arise, Tuesday’s doctor must have treated 
Billy despite Billy’s medical chart showing that Monday’s doctor treated him. Tuesday’s doc- 
tor could have also easily talked to Billy to confirm this. Arguably, this makes the latter world 
more of a violation of normality than the former. (The alternative approach works the same 
way for this example.) 


When it comes to determining the cause of the bulb turning on, the situation is more 
problematic. Taking wu to be the context where A = 1, both s4=o,B=1, and spo, seem 
less normal than s,,, and there seems to be no reason to prefer one to the other, so graded 
causality does not help in determining a cause. Similarly, for the alternative approach, both 
[A =0AB=1AC =O] and [B = 0AC = Oj are less normal than either [A = 1A C = 1] 
or [B = 1AC = 1]. Again, there is no reason to prefer B = 1 as a cause. If instead we 
consider the version of the story where, instead of a light bulb turning on, the result is a person 
being shocked, we can view both [A = 0A B=1AC = 0] and [B = 0AC = 0] as more 
normal than either [A = 1 A C = 1] or [B = 1A C = ]], since shocking someone can be 
viewed as abnormal. Now if we take normality into account, both A = 1 and B = 1 become 
causes, but there is still no reason to prefer B = 1. 
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We can deal with both of these issues by making one further change to how we take nor- 
mality into account. Rather than considering the “absolute” normality of a witness, we can 
consider the change in normality in going from the actual world to the witness and prefer the 
witness that leads to the greatest increase (or smallest decrease) in normality. If the normality 
ordering is total, then whether we consider the absolute normality or the change in normality 
makes no difference. If witness w 1 is more normal than witness we, then the increase in nor- 
mality from the actual world to w, is bound to be greater than the increase in normality from 
the actual world to wz. But if the normality ordering is partial, then considering the change in 
normality from the actual world can make a difference. 

Going back to Billy’s doctors, it seems reasonable to take the increase in normality go- 
ing from syr=1,TT=1,u tO SmT=1,TT=0,u to be much greater than that of going from 
SMT=1,TT=1,u tO SuT=0,TT=1,u. Thus, if both doctors treat Billy, then Tuesday’s doctor 
is a much better cause of Billy feeling sick on Wednesday than Monday’s doctor. In contrast, 
the increase in normality in going from sy7=0, TT=0,u tO SuT=1,TT=0,u 18 arguably greater 
(although not much greater) than that of going from s \y7r—0,rT=0,u t0 SuT=0,TT=1,u- Thus, 
if neither doctor treats Billy, then Monday’s doctor is a better cause of Billy feeling sick on 
Wednesday than Tuesday’s doctor, but perhaps not a much better cause. Note that we can 
have such a relative ordering if the normality ordering is partial (in particular, if the worlds 
S8MT=0,TT=0,u and $\y7=1,TT=1,u are incomparable), but not if the normality ordering is 
total. 

Similar considerations apply to the light bulb example. Now we can take the change from 
the world where both A = 1 and B = 1 to the world where A = 1 and B = 0 to be smaller 
than the one to the world where A = 0 and B = 1, because the latter change involves changing 
what A does as well as violating normality (in the sense that B does not act according to the 
equations), while the former change requires only that 6 violate normality. This gives us a 
reason to prefer B = 1 as a cause. JJ 


Considering not just how normal a witness is but the change in normality in going from 
the actual world to the witness requires a richer model of normality, where we can consider 
differences. I do not attempt to formalize such a model here, although I believe that it is 
worth exploring. These considerations also arise when dealing with responsibility (see Exam- 
ple 6.3.4 for further discussion). 


Notes 


Much of the material in this chapter is taken from [Halpern and Hitchcock 2015] (in some 
cases, verbatim); more discussion can be found there regarding the use of normality and 
how it relates to other approaches. A formal model that takes normality into account in the 
spirit of the definitions presented here already appears in preliminary form in the original HP 
papers [Halpern and Pearl 2001; Halpern and Pearl 2005a] and in [Halpern 2008]. Others 
have suggested incorporating considerations of normality and defaults in a formal definition 
of actual causality, including Hall [2007], Hitchcock [2007], and Menzies [2004, 2007]. 

The idea that normality considerations are important when considering causality goes back 
to the work of psychologists. Exploring the effect of norms on judgments of causality is also 
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an active area of research. I mention here some representatives of this work, although these 
references do not do anywhere close to full justice to the ongoing work: 


=» Kahneman and Tversky [1982] and Kahneman and Miller [1986] showed that both 
statistical and prescriptive norms can affect counterfactual reasoning. 


Alicke [1992] and Alicke et al. [2011] showed that subjects are more likely to judge 
that someone caused some negative outcome when they have a negative evaluation of 
that person. 


Cushman, Knobe, and Sinnott-Armstrong [2008] showed that subjects are more likely 
to judge that an agent’s action causes some outcome when they hold that the action is 
morally wrong; Knobe and Fraser [2008] showed that subjects are more likely to judge 
that an action causes some outcome if it violates a policy (the Prof. Smith example in 
Section 3.1 is taken from their paper); Hitchcock and Knobe [2009] showed that this 
effect occurs with norms of proper functioning. 


McGill and Tenbrunsel [2000] considered the effect of likelihood and what they call 
mutability and propensity on normality judgments (and hence on causality judgments). 
Roughly speaking, propensity would judge a witness world to be more normal if it had 
a higher prior probability of occurring; if we are trying to determine whether X = z is 
a cause of and the witness world has X = 2’, then mutability takes into account how 
“easy” it is to change X from x to x’. These issues are best explained by an example. 
Mandel and Lehman [1996] considered a scenario where an executive who chose to 
take a different route home was hit by a drunk teenager. Drunk driving is more likely to 
cause accidents than taking nonstandard routes home. Thus, propensity considerations 
(as well as notions of normality involving conventions) would say that the teenager 
should be a cause. However, if the teenager were perceived to be incapable of changing 
his drinking behavior (perhaps he is under heavy social pressure), then the causality 
ascribed to the choice of route goes up. McGill and Tenbrunsel [2000] showed that the 
cause with higher propensity is typically chosen, provided it is perceived to be mutable. 


Approaches to giving semantics to typicality statements include preferential structures 
[Kraus, Lehmann, and Magidor 1990; Shoham 1987], ¢-semantics [Adams 1975; Geffner 
1992; Pearl 1989], possibility measures [Dubois and Prade 1991], and ranking functions 
[Goldszmidt and Pearl 1992; Spohn 1988]. The details of these approaches do not matter 
for our purposes; see [Halpern 2003] for an overview of these approaches. The approach used 
in Section 3.2 is the one used in [Halpern and Hitchcock 2015]; it can be viewed as an instance 
of preferential structures, which allow a partial preorder on worlds. By way of contrast, pos- 
sibility measures and ranking functions put a total preorder on worlds. The model of causality 
that took normality considerations into account that I gave in [Halpern 2008] used ranking 
functions and, as a consequence, made the normality ordering total. Using a total normality 
order causes problems in some examples (see below). 

There has been a great deal of discussion in the philosophy community regarding whether 
omissions count as causes. Beebee [2004] and Moore [2009], for example, argue against 
the existence of causation by omission in general; Lewis [2000, 2004] and Schaffer [2000a, 
2004, 2012] argue that omissions are genuine causes in such cases; Dowe [2000] and Hall 
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[2004] argue that omissions have a kind of secondary causal status; and McGrath [2005] 
argues that the causal status of omissions depends on their normative status. For example, 
Billy’s doctor’s failure to treat Billy is a cause of Billy’s sickness because he was supposed 
to treat Billy; the fact that other doctors also failed to treat Billy does not make them causes. 
As I observed in the main text, by taking an appropriate normality ordering, we can capture 
all these points of view. Wolff, Barbey, and Hausknecht [2010] and Livengood and Machery 
[2007] provide some experimental evidence showing how people view causality by omission 
and when they allow for causes that are omissions. Their results support the role of normality 
in these causality judgments (although the papers do not mention normality). 

As I said in the main text, there have been claims that it is a mistake to focus on one cause. 
For example, John Stuart Mill [1856, pp. 360-361] writes: 


...it is very common to single out one only of the antecedents under the denom- 
ination of Cause, calling the others mere Conditions. ...The real cause, is the 
whole of these antecedents; and we have, philosophically speaking, no right to 
give the name of cause to one of them, exclusively of the others. 


Lewis [1973a, pp. 558-559] also seems to object to this practice, saying: 


We sometimes single out one among all the causes of some event and call it “the” 
cause, as if there were no others. Or we single out a few as the “causes”, calling 
the rest mere “causal factors” or “causal conditions”. ...I have nothing to say 
about these principles of invidious discrimination. 


Hall [2004, p. 228] adds: 


When delineating the causes of some given event, we typically make what are, 
from the present perspective, invidious distinctions, ignoring perfectly good 
causes because they are not sufficiently salient. We say that the lightning bolt 
caused the forest fire, failing to mention the contribution of the oxygen in the air, 
or the presence of a sufficient quantity of flammable material. But in the egalitar- 
ian sense of “cause”, a complete inventory of the fire’s causes must include the 
presence of oxygen and of dry wood. 


Sytsma, Livengood, and Rose [2012] conducted the follow-up to the Knobe and Fraser 
[2008] experiment discussed in Example 3.4.1 and Section 3.1. They had their subjects rate 
their agreement on a 7-point scale from 1 (completely disagree) to 7 (completely agree) with 
the statements “Professor Smith caused the problem” and “The administrative assistant caused 
the problem”. When they repeated Knobe and Fraser’s original experiment, they got an aver- 
age rating of 4.05 for Professor Smith and 2.51 for the administrative assistant. Although their 
difference is less dramatic than Knobe and Fraser’s, it is still statistically significant. When 
they altered the vignette so that Professor Smith’s action was permissible, subjects gave an 
average rating of 3.0 for Professor Smith and 3.53 for the administrative assistant. 

The bogus prevention problem is due to Hitchcock [2007]; it is based on an example due 
to Hiddleston [2005]. Using a normality ordering that is total (as I did in [Halpern 2008]) 
causes problems in dealing with Example 3.4.1. With a total preorder, we cannot declare 
(A = 1,B = 0,VS = 0) and (A = 0,B = 1, VS = 1) to be incomparable; we must 
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compare them. To argue A = 1 is not a cause, we have to assume that (A = 1,B = 1, VS 
1) is more normal than (A = 0, B = 0, VS = 0). This ordering does not seem so natural. 

Example 3.4.2 is due to Hitchcock [2007], where it is called “counterexample to Hitch- 
cock”. Its structure is similar to Hall’s short-circuit example [Hall 2007, Section 5.3]; the 
same analysis applies to both. The observation that normality considerations deal well with 
it is taken from [Halpern 2015a]. The observation that normality considerations do not com- 
pletely solve the railroad switch problem, as shown in Example 3.4.3, is due to Schumacher 
[2014]. The version of the switch problem with the variables LT’ and RT is essentially how 
Hall [2007] and Halpern and Pearl [2005a] modeled the problem. The voting examples in 
Section 3.4.3 are due to Livengood [2013]; the analysis given here in terms of normality is 
taken from [Halpern 2014a]. The analysis of all the other examples in Section 3.4 is taken 
from [Halpern and Hitchcock 2015]. 

For the details of Regina v. Faulkner, see [Regina v. Faulkner 1877]. Moore [2009] uses 
this type of case to argue that our ordinary notion of actual causation is graded, rather than 
all-or-nothing, and that it can attenuate over the course of a causal chain. In the postscript of 
[Lewis 1986b], Lewis uses the phrase “sensitive causation” to describe cases of causation that 
depend on a complex configuration of background circumstances. For example, he describes 
a case where he writes a strong letter of recommendation for candidate A, thus earning him 
a job and displacing second-place candidate 6B, who then accepts a job at her second choice 
of institution, displacing runner-up C’,, who then accepts a job at another university, where he 
meets his spouse, and they have a child, who later dies. Lewis claims that although his writing 
the letter is indeed a cause of the death, it is a highly sensitive cause, requiring an elaborate set 
of detailed conditions to be present. Woodward [2006] says that such causes are “unstable”. 
Had the circumstances been slightly different, writing the letter would not have produced this 
effect (either the effect would not have occurred or it would not have been counterfactually 
dependent on the letter). Woodward argues that considerations of stability often inform our 
causal judgments. The extended HP definition allows us to take these considerations into 
account. 

The Watson v. Kentucky and Indiana Bridge and Railroad can be found in [Watson v. 
Kentucky and Indiana Bridge and Railroad 1910]. 

Sander Beckers [private communication, 2015] pointed out that putting a normality order- 
ing on worlds does not really solve the problem of making BT = 1 part of a cause of BS = 1 
(see the discussion after Example 3.2.6), thus providing the impetus for the alternative defini- 
tion of normality discussed in Section 3.5. Example 3.5.1 is due to Zultan, Gerstenberg, and 
Lagnado [2012], as are the experiments showing that people view Alice as a better cause than 
Bob in this example. The work of Zultan, Gerstenberg, and Lagnado related to this example is 
discussed further in the notes to Chapter 6. The example of C' being shocked in Example 3.5.2 
is due to McDermott [1995]. He viewed it as an example of lack of transitivity: A’s flipping 
right causes B to flip right, which in turn causes C’ to get a shock, but most people don’t view 
A’s flip as being a cause of C’s shock, according to McDermott. As I observed, normality 
considerations can also be used to give us this outcome. Various approaches to extending a 
partial ordering = on worlds to an ordering =“ on events are discussed in [Halpern 1997]; the 
notation ~° is taken from [Halpern 2003]. 
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Chapter 4 


The Art of Causal Modeling 


Fashion models and financial models are similar. They bear a similar relation- 
ship to the everyday world. Like supermodels, financial models are idealized 
representations of the real world, they are not real, they don’t quite work the way 
that the real world works. There is celebrity in both worlds. In the end, there is 
the same inevitable disappointment. 


Satyajit Das, Traders, Guns & Money: Knowns and Unknowns in the Dazzling World of 
Derivatives 


Essentially, all models are wrong, but some are useful. 
G. E. P. Box and N. R. Draper, Empirical Model Building and Response Surfaces 


In the HP definition of causality, causality is relative to a causal model. X = x can be the 
cause of y in one causal model and not in another. Many features of a causal model can 
impact claims of causality. It is clear that the structural equations can have a major impact 
on the conclusions we draw about causality. For example, it is the equations that allow us to 
conclude that lower air pressure is the cause of the lower barometer reading and not the other 
way around; increasing the barometer reading will not result in higher air pressure. 

But it is not just the structural equations that matter. As shown by the Suzy-Billy rock- 
throwing example (Example 2.3.3), adding extra variables to a model can change a cause to a 
non-cause. Since only endogenous variables can be causes, the split of variables into exoge- 
nous and endogenous can clearly affect what counts as a cause, as we saw in the case of the 
oxygen and the forest fire. As a number of examples in Chapter 3 showed, if we take normal- 
ity considerations into account, the choice of normality ordering can affect causality. Even 
the set of possible values of the variables in the model makes a difference, as Example 2.3.7 
(where both the sergeant and captain give orders) illustrates. 

Some have argued that causality should be an objective feature of the world. Particularly in 
the philosophy literature, the (often implicit) assumption has been that the job of the philoso- 
pher is to analyze the (objective) notion of causation, rather like that of a chemist analyzing 
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the structure of a molecule. In the context of the HP approach, this would amount to desig- 
nating one causal model as the “right” model. I do not believe that there is one “right” model. 
Indeed, in the spirit of the quote at the beginning of this chapter, I am not sure that there are 
any “right” models, but some models may be more useful, or better representations of reality, 
than others. Moreover, even for a single situation, there may be several useful models. For 
example, suppose that we ask for the cause of a serious traffic accident. A traffic engineer 
might say that the bad road design was the cause; an educator might focus on poor driver 
education; a sociologist might point to the pub near the highway where the driver got drunk; a 
psychologist might say that the cause is the driver’s recent breakup with his girlfriend. Each of 
these answers is reasonable. By appropriately choosing the variables, the structural-equations 
framework can accommodate them all. 

That said, it is useful to have principles by which we can argue that one model is more 
reasonable/useful/appropriate than another. Suppose that a lawyer argues that, although his 
client was drunk and it was pouring rain, the cause of the accident was the car’s faulty brakes, 
which is why his client is suing GM for $5,000,000. If the lawyer were using the HP definition 
of causality, he would then have to present a causal model in which the brakes were the cause. 
His opponent would presumably then present a different model in which the drunkenness or 
the rain was a cause. We would clearly want to say that a model that made the faulty brakes a 
cause because it did not include the rain as an endogenous variable, or it took drunkenness to 
be normal, was not an appropriate model. 

The structural equations can be viewed as describing objective features of the world. At 
least in principle, we can test the effects of interventions to see whether they are correctly 
captured by the equations. In some cases, we might be able to argue that the set of values 
of variables is somewhat objective. In the example above, if C' can only report values 0 and 
1, then it seems inappropriate to take {0,1,2} to be its set of values. But clearly what is 
exogenous and what is endogenous is to some extent subjective, as is the choice of variables, 
and the normality ordering. Is there anything useful that we can say about them? In this 
chapter, I look at the issue and its impact more carefully. 


4.1 Adding Variables to Structure a Causal Scenario 


A modeler has considerable leeway in choosing which variables to include in a model. As the 
Suzy-Billy rock-throwing example shows, if we want to argue that X = z is the cause of y 
rather than Y = y, then there must be a variable that takes on different values depending on 
whether X = x or Y = y is the actual cause. If the model does not contain such a variable, 
then there is no way to argue that X = a is the cause. There is a general principle at play here. 
Informally, say that a model is insufficiently expressive if it cannot distinguish two different 
scenarios (intuitively, ones for which we would like to make different causal attributions). 
The naive rock-throwing model of Figure 2.2, which just has the endogenous variables ST, 
BT, and BS, is insufficiently expressive in this sense because it cannot distinguish between a 
situation where Suzy and Billy’s rock hit the bottle simultaneously (in which case we would 
want to call Billy’s throw a cause) and the one where Suzy hits first (so Billy’s throw is not 
a cause). What about the bogus prevention example from Section 3.4.2? As we saw, here 
too adding an extra variable can give us a more reasonable outcome (without appealing to 
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normality considerations). I would argue that in this case as well, the problem is that the 
original model was insufficiently expressive. 

The next four examples show the importance of having variables that bring out the causal 
structure of a scenario and to distinguish two causal scenarios. One way of understanding the 
role of the variables SH and BH in the rock-throwing example is that they help us distinguish 
the scenario where Suzy and Billy hit simultaneously from the one where Suzy hits first. The 
following examples show that this need to distinguish scenarios is quite widespread. 


Example 4.1.1 There are four endogenous binary variables, A, B, C, and S, taking values 
1 (on) and 0 (off). Intuitively, A and B are supposed to be alternative causes of C’, and S' acts 
as a switch. If S = 0, the causal route from A to C is active and that from B to C is dead; 
and if S = 1, the causal route from A to C is dead and the one from B to C is active. There 
are no causal relations between A, B, and S; their values are determined by the context. The 
equation for Cis C = (,S A A) V (SA B). 

Suppose that the context is such that A= B= S=1,soC = 1. B = 1 isa but-for cause 
of C = 1, so all the variants of the HP definition agree that it is a causes, as we would hope. 
Unfortunately, the original and updated definition also yield A = 1 as a cause of C = 1. The 
argument is that in the contingency where S' is set to 0, if A = 0, then C' = 0, whereas if 
A = 1, then C = 1. This does not seem so reasonable. Intuitively, if S = 1, then the value of 
A is irrelevant to the outcome. Considerations of normality do not help here; all worlds seem 
to be equally normal. Although the modified HP definition does not take A = 1 to be a cause, 
it is part of a cause: A = 1 A S = 1 is acause of C = 1 because if A and S are both set to 0, 
C=0. 

This might seem to represent a victory for the modified HP definition, but the situation is 
a little more nuanced. Consider a slightly different story. This time, we view B as the switch, 
rather than S. If B = 1, then C = 1if either A = 1 or S = 1; if B = 0, then C = 1 only if 
A=land S = 0. Thatis, C =(BA(AVS))V (ABA AA-—\S). Although this is perhaps 
not as natural a story as the original, such a switch is surely implementable. In any case, a 
little playing with propositional logic shows that, in this story, C’ satisfies exactly the same 
equation as before: (=~S A A) V (S A B) is equivalent to (B A (AV S)) V (ABA AA-S). 
The key point is that, unlike the first story, in the second story, it seems to me quite reasonable 
to say that A = 1 is acause of C = 1 (as is B = 1). Having A = 1 is necessary for the first 
“mechanism” to work. But as long as we model both stories using just the variables A, B, C, 
and S', they have the same models with the same equations, so no variant of the HP definition 
can distinguish them. 

Given that we have different causal intuitions for the stories, we should model them dif- 
ferently. One way to distinguish them is to add two more endogenous random variables, say 
D and E, that describe the ways that C’ could be 1. In the original story, we would have the 
equations D = =S A A, E = SA B,and C = DV E. In this model, since D = 0 in the 
actual context, it is not hard to see that all variants of the HP definition agree that A = 1 is 
not a cause of C' = 1 (nor is it part of a cause), whereas B = 1 and S = 1 are causes, as they 
should be. Thus, in this model, we correctly capture our intuitions for the original story. 

To capture the second story, we can add variables D’ and E’ such that D’ = BA (AVS), 
E' =aBAAAS,andC = D’'\ E". In this model, it is not hard to see that according to 
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the original and updated definition, all of A = 1, B = 1, and S = 1 are causes of C = 1; 
according to the modified HP definition, the causes stay the same: B = land A=1AS =1. 

Now, as I said above, we can certainly create these setups in the real world. Consider 
the original story. We can design a circuit where there is a source of power at A and B, a 
physical switch at S, and a bulb at C' that turns on (C' = 1) if either the circuit is connected 
to A (A = 1) and the switch is turned left (S = 0) or the circuit is connected to B (B = 1) 
and the switch is turned right (S = 1). We can similarly design a circuit that captures the 
second story. With this circuit, we can think of D as representing “there is current flowing 
between power source A and bulb C”; there are similar physical analogues of E', D’, and E’. 
But even without physical analogues, these “structuring” variables can play an important role. 
Suppose that the agent does not see the underlying physical setup. All he sees are switches 
corresponding to A, B, and S, which can be manipulated. In the first setup, flipping the switch 
for A connects or disconnects the circuit and the A battery, and similarly for B, while flipping 
the switch for S connects C either to the A or B battery. In the second story, the switches 
for A, B, and S work differently, but as far as the observer is concerned, all interventions 
lead to the same outcome on the bulb C' in both cases. A modeler could usefully introduce D 
and E£ (resp., D’ and E’) to disambiguate these models. These variables can be viewed as the 
modeler’s proxies for the underlying physical circuit; they help describe the mechanisms that 
result in C' being 1. 

Note that we do not want to think of D as being defined to take the value 1 if A = 1 and 
S = 0. For then we could not intervene to set D = 0if A = 1 and S = 0. Adding a variable 
to the model commits us to being able to intervene on it. (I return to this point in Section 4.6.) 
In the real world, setting D to 0 despite having A = 1 and S = 0 might correspond to the 
connection being faulty when the switch is turned left. 


Example 4.1.2 A lamp L is controlled by three switches, A, B, and C, each of which has 
three possible positions, —1, 0, and 1. The lamp switches on iff two or more of the switches 
are in the same position. Thus, L = 1 iff (A = B) V(B =C)V (A =C). Suppose that, in 
the actual context, A = 1, B = —1, and C = —1. Intuition suggests that although B = —1 
and C' = —1 should be causes of L = 1, A = 1 should not: since the setting of A does not 
match that of either B or C, it has no causal impact on the outcome. 

Since B = —1 and C = —1 are but-for causes of L = 1, all three variants of the HP 
definition declare them to be causes. Unfortunately, the original and updated definition also 
declare A = 1 to be a cause. For in the contingency where B = 1 and C = —1,if A = 1, 
then L = 1, whereas if A = 0, then L = 0. Adding defaults to the picture does not solve the 
problem. Again, the modified HP definition does the “right” thing here and does not declare 
A = 1 to be a cause or even part of a cause. 

However, just as in Example 4.1.1, another story can be told here, where the observed 
variables have the same values and are connected by the same structural equations. Now 
suppose that L = 1 iff either (a) none of A, B, or C is in position —1; (b) none of A, B, or C 
is in position 0; or (c) none of A, B, or C is in position 1. It is easy to see that the equations 
for L are literally the same as in the original example. But now it seems more reasonable to 
say that A = 1 is a cause of L = 1. Certainly A = 1 causes L = 1 as a result of no values 
being 0; had A been 0, then the lamp would still have been on, but now it would be as a result 
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of no values being —1. Considering the contingency where B = 1 and C = —1 “uncovers” 
the causal impact of A. 

Again, the distinction between the two stories can be captured by adding more variables. 
For the second story, add the variables NOT(—1), NOT(0), and NOT(1), where NOT (i) 
is 1 iff none of A, B, or C is 7. Then L = NOT(—1) V NOT(0) V NOT(1). Now all three 
variants of the HP definition agree that A = 1 is a cause of L = 1 (as well as B = —1 and 
C = —1). For the original story, add the variables TWO(—1), TWO(0), and TWO(1), 
where TWO(i) = 1 iff at least two of A, B, and C are i, and take L = TWO(-1) V 
TWO(0) V TWO(1). In this case, all three variants agree that A = 1 is not a cause of L = 1 
(whereas B = —1 and C = —1 continue to be causes). 

Once again, I think of the variables NOT (—1), NOT (0), and NOT(1)(resp., TWO(-1), 
TWO(0), and TWO(1)) as “structuring” variables that help the modeler distinguish the two 
scenarios and, specifically, the mechanisms that lead to the lamp turning on. 

This example (and the others in this section) show that causality is to some extent depen- 
dent on how we describe a situation. If we describe the rule for the lamp being on as “it is on 
if two or more of the switches are in the same position”, we get a different answer than if we 
say “it is on if there is a position such that none of the switches are in that position”, despite 
the fact that these two rules lead to the same outcomes. This seems to me compatible with 
human ascriptions of causality. But even if accept that, where does this leave a modeler who 
is thinking in terms of, say, the first rule, and to whom the second rule does not even occur? 
My sense is that the description of the first rule naturally suggests a “structuring” in terms 
of the variables TWO(—1), TWO(0), and TWO(1), so it is good modeling practice to add 
these variables even if no other way of structuring the story occurs to the modeler. If 


Example 4.1.3 Consider a model M with four endogenous variables, A, B, D, and FE’. The 
values of A and D are determined by the context. The values of B and F are given by the 
equations B = A and E = D. Suppose that the context u is such that A = D = 1. Then 
clearly, in (IW, u), A = 1 is a cause of B = 1 and not a cause of EF = 1, whereas D = Lisa 
cause of EF = 1 and not of B = 1. The problem comes if we replace A in the model by X, 
where, intuitively, X = 1 iff A and D agree (i.e., X = 1 if the context would have been such 
that A = D =1 or A= D = 0). The value of A can be recovered from the values of D and 
X. Indeed, it is easy to see that A = 1 iff X = D=1or X = D = 0. Thus, we can rewrite 
the equation for B by taking B = 1iff X = D = 1 o0r X = D =0. Formally, we have a new 
model M’ with endogenous variables X, B, D, and E. The value of D and X is given by the 
context; as suggested above, equations for the B and F are 


a ae 1 ifX =D 
~ ) 0 ifX AD; 
eh=-D. 
The context u is such that D = X = 1,so B = E = 1. In (M’,u), it is still the case that 
D =1isacause of EF = 1, but now D = 1 is alsoacause of B = 1. 
Is it so unreasonable that D = 1 is acause of B = 1 in M’ but is not a cause of B = 1 in 


MM? Of course, if we have in mind the story represented by model M, then it is unreasonable 
for D = 1 to be acause. But M’ can also represent quite a different situation. Consider the 
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following two stories. We are trying to determine the preferences of two people, Betty and 
Edward, in an election. B = 1 if Betty is recorded as preferring the Democrats and B = 0 
if Betty is recorded as preferring the Republicans, and similarly for EL. In the first story, we 
send Alice to talk to Betty and David to talk to Edward to find out their preferences (both are 
assumed to be truthful and good at finding things out). When Alice reports that Betty prefers 
the Democrats (A = 1), then Betty is reported as preferring the Democrats (B = 1); similarly 
for David and Edward. Clearly, in this story (which is modeled by M), D = 1 causes EF = 1 
but not B = 1. 

But now suppose instead of sending Alice to talk to Betty, Xavier is sent to talk to Carol, 
who knows only whether Betty and Edward have the same preferences. Carol tells Xavier 
that they indeed have the same preferences (X = 1). Upon hearing that X = D = 1, the 
vote tabulator correctly concludes that B = 1. This story is modeled by 4’. But in this case, 
it strikes me as perfectly reasonable that D = 1 should be a cause of B = 1. This is true 
despite the fact that if we had included the variable A in I’, it would have been the case that 


A=B=1. 
Again, we have two different stories; we need to have different models to disambiguate 
them. fl 


At the risk of overkill, here is yet one more example: 


Example 4.1.4 A ranch has five individuals: a1,...,a5. They have to vote on two possible 
outcomes: staying at the campfire (O = 0) or going on a round-up (O = 1). Let A; be 
the random variable denoting a;’s vote, so A; = j if a; votes for outcome j. There is a 
complicated rule for deciding on the outcome. If a; and az agree (i.e., if 4; = Ag), then that 
is the outcome. If ag,...,a5 agree and a; votes differently, then the outcome is given by a,’s 
vote (i.e., O = A;). Otherwise, majority rules. In the actual situation, Ay = Ag = 1 and 
A3 = Ag = As = 0, s0 by the first mechanism, O = 1. The question is what were the causes 


of O= 1. 
Using the naive causal model with just the variables A,,...,A5,O, and the obvious equa- 
tions describing O in terms of A;,..., As, it is almost immediate that 4; = 1 is a but-for 


cause of O = 1. Changing A, to 0 results in O = 0. Somewhat surprisingly, in this naive 
model, Ap = 1, A3 = 0, Ay = 0, and As = 0 are also causes according to the original and 
updated HP definition. To see that Ap = 1 is a cause, consider the contingency where A3 = 1. 
Now if Ag = 0, then O = 0 (majority rules); if Ap = 1, then O = 1, since A; = Ag = 1, and 
O = 1evenif As; is set back to its original value of 0. The argument that A3 = 0, Ay = 0, 
and A; = 0 are causes according to the original and updated definition is the same; I will give 
the argument for Az = 0 here. Consider the contingency where Az = 0, so that all voters but 
a, vote for 0 (staying at the campsite). If A3 = 1, then O = 0 (majority rules). If A3 = 0, 
then O = 1, by the second mechanism (a, is the only vote for 0), whereas if Ag is set to 
its original value of 1, then we still have O = 1, now by the first mechanism. Although the 
modified HP definition does not declare any of Az = 1, A3 = 0, Ay = 0, or As = 0 to be 
causes, it does declare each of the conjunctions Az = 1 A As = 0, Ag = 1 A Aq = O, and 
Az = 1A As = 0 to be a cause; for example, if the values of Ag and A3 are changed, then 
O = 0. So, according to the modified HP definition, each of Ay = 1, A3 = 0, Ay = 0, and 
As = 0 is part of a cause. 
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An arguably better model of the story is suggested by the analysis, which talks about the 
first and second mechanisms. This, in turn, suggests that the choice of mechanism should be 
part of the model. There are several ways of doing this. One is to add three new variables, 
call them M1, M2, and M3. These variables have values in {0,1,2}, where M; = Oif 
mechanism j is active and suggests an outcome 0, M; = 1 if mechanism j is active and 
suggests an outcome of 1, and M/; = 2 if mechanism 7 is not active. (We actually don’t need 
the value M3 = 2; mechanism 3 is always active, because there is always a majority with 5 
voters, all of whom must vote.) There are obvious equations linking the value of M,, Mo, 
and Ms to the values of A,,..., As. (Actually, since the mechanisms are now represented by 
variables that can be intervened on, we would have to define what happens as a result of an 
“unnatural” intervention that sets both 174; = 1 and M2 = 1; for definiteness, say in that case 
that mechanism /, is applied.) 

Now the value of O just depends on the values of M,, Mo, and Ms: if M, ¥ 2, then 
O = M,; if M, = 2 and M2 ¥ 2, then O = Mg; and if My = Mz = 2, then O = M3. It is 
easy to see that in this model, in the context where A; = Ag = 1 and A3 = Ay = As = 0, 
then, according to all the variants of the HP definition, none of Az; = 0, A4 = 0, or As = 0 
is a cause, whereas A; = 1 is a cause, as we would expect, as are Ag = 1 and Mp = 1. 
This seems reasonable: the second mechanism was the one that resulted in the outcome, and 
it required A; = Ay = 1. 

Now suppose that we change the description of the voting rule. We take O = 1 if one of 
the following two mechanisms applies: 


a A, = 1 and itis not the case that both Az = 0 and exactly one of A3, Ay, and As is 1. 
» A; = 0, Ao = 1, and exactly two of A3, Ag, and As are 1. 


It is not hard to check that, although the description is different, O satisfies the same equation 
in both stories. But now it does not seem so unreasonable that A» 1, Az = 0, Ag = 0, 
and As = 0 are causes of O = 1. And indeed, if we construct a model in terms of these two 
mechanisms (i.e., add variables 1/7; and M3 that correspond to these two mechanisms), then it 
is not hard to see that Ay = 1, Ag = 1, Az = 0, Ag = 0, and As = 0 are all causes according 
to the original and updated HP definition. With the modified definition, things are unchanged: 
A; =1, Ag =1/A Az =0, Ag = 1A Aq = 0, and Ay = 1 A As = O continue to be causes 
(again, {A2, A3}, {Ao, As}, and { Az, As} are minimal sets of variables whose values have 
to change to change the outcome if the first mechanism is used). 

Here the role of the structuring variables 1, M2, and M3 (resp. Mj and M5) as descrip- 
tors of the mechanism being invoked seems particularly clear. For example, setting /, = 2 
says that the first mechanism will not be applied, even if Ay; = Ag; setting MM, = 1 says that 
we act as 1f both a; and a2 voted in favor, even if that is not the case. Jj 


The examples considered so far in this section show how, by adding variables that describe 
the mechanism of causality, we can distinguish two situations that otherwise seem identical. 
As the following example shows, adding variables that describe the mechanism also allows 
us to convert a part of a cause (according to the modified HP definition) to a cause. 


Example 4.1.5 Suppose that we add variables A, B, and C to the disjunctive forest-fire 
example (Example 2.3.1), where A = LD A 43MD, B = =L A MD, and C = L A MD. We 
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then replace the earlier equation for FF (i.e., FF = LV MD) by FF = AV BV C. The 
variables A, B, and C' can be viewed as describing the mechanism by which the forest fire 
happened. Did it happen because of the dropped match only, because of the lightning only, 
or because of both? In this model, not only are L = 1 and MD = 1 causes of FF = 1 
according to the original and updated HP definition, they are also causes according to the 
modified definition. For if we fix A and B at their actual values of 0, then FF = 0 if L is set 
to 0, so AC2(a™) is satisfied and L = 1 is a cause; an analogous argument applies to MD. 

I would argue that this is a feature of the modified definition, not a bug. Suppose, for ex- 
ample, that we interpret A, B, and C as describing the mechanism by which the fire occurred. 
If these variables are in the model, then this suggests that we care about the mechanism. The 
fact that L = 1 is part of the reason that FF = 1 occurred thanks to mechanism C’. Although 
the forest fire would still have occurred if the lightning hadn’t struck, it would have been due 
to a different mechanism. (As an aside, adding mechanisms in this way is actually a general 
approach for converting conjunctive causes in the the modified HP definition to single con- 
juncts. It is easy to see that a similar approach would work for Example 2.3.2, where Suzy 
wins the election 11-0, although now we would have to add () new variables, one for each 
possible subset of 6 voters.) 

In the original model, we essentially do not care about the details of how the fire comes 
about. Now suppose that we care only about whether lightning was a cause. In that case, we 
would add only the variable B, with 6 = —L A MD, as above, and set FF = LV B. In this 
case, in the context where L = MD = 1, all three variants of the HP definition agree that 
only L = 1is acause of FF = 1; MD = 1 is not (and is not even part of a cause). Again, 
I would argue that this is a feature. The structure of the model tells us that we should care 
about how the fire came about, but only to the extent of whether it was due to L = 1. In the 
actual context, 1/D = 1 has no impact on whether L = 1. 

But now consider the variant of the disjunctive model considered in Example 2.9.1. Recall 
that in this variant, there are two arsonists; in the abnormal setting where only one of them 
drops a match, FF’ = 2: there is a less significant fire. In this model, MD = 1 is a cause of 
the forest fire according to modified HP definition, as we would expect, but L = 1 is not even 
part of a cause, since L = 1A MD = 1 is no longer a cause (MD = 1 is acause all by itself). 
This seems at least mildly disconcerting. But now normality considerations save the day. By 
assumption, the witness MD = 1A MD’! = 0A FF = 2 needed to declare MD = 1 a cause is 
abnormal, so AC2T(a”) does not hold. Using graded causality, L = 1A MD = 1 is a better 
cause than MD = 1.§ 


4.2 Conservative Extensions 


In the rock-throwing example, by adding the variables SH and BH, and moving from the 
naive model Mrr to M},,, BT = 1 changed from being a cause to not being a cause of 
BS = 1. Similarly, adding extra variables affected causality in all the examples in Section 4.1. 
Of course, without any constraints, it is easy to add variables to get any desired result. For 
example, consider the “sophisticated” rock-throwing model MM}. Suppose that a variable 
BH, is added with equations that set BH, = BT and BS = SH V BT V BH, giving a 
new causal model M%,.-. In M/,p, there is a new causal path from BT to BS going through 
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BH, independent of all other paths. Not surprisingly, in Mi, BT = 1 is indeed a cause of 
BS =1. 

But this seems like cheating. Adding this new causal path fundamentally changes the 
scenario; Billy’s throw has a new way of affecting whether the bottle shatters. Although it 
seems reasonable to refine a model by adding new information, we want to do so in a way 
that does not affect what we know about the old variables. Intuitively, suppose that we had a 
better magnifying glass and could look more carefully at the model. We might discover new 
variables that were previously hidden. But we want it to be the case that any setting of the old 
variables results in the same observations. That is, while adding the new variable refines the 
model, it does not fundamentally change it. This is made precise in the following definition. 


Definition 4.2.1 A causal model M/’ = ((U’,V',R’), F’) is a conservative extension of 
M = (U,Y,R),F) if U =U', V Cc V’, and for all contexts i, all variables X € V, and 
all settings « of the variables in W = V — {X}, we have (M, i) - [W © W(X = 2) iff 
(M', i) ] [W © w](X = 2). That is, no matter how we set the variables in M other than 
X, X has the same value in context 7 in both M and M’. J 


According to Definition 4.2.1, M/’ is a conservative extension of iff, for certain formulas 
1) involving only variables in V, namely, those of the form [W < w](X = x), (M, i) & w iff 
(M’, ii) — v. As the following lemma shows, this is actually true for all formulas involving 
only variables in Y, not just ones of a special form. (It is perhaps worth noting that here is 
a case where we cannot get away with omitting the causal model when using statisticians’ 
notation. The notion of a conservative extension requires comparing the truth of the same 
formula in two different causal models.) 


Lemma 4.2.2 Suppose that M' is a conservative extension of M = ((U,V,R), F). Then for 
all causal formulas ip that mention only variables in V and all contexts t, we have (M, ti) — 


iff (M’,%) & ¢. 


It is not hard to show that the “sophisticated” model for the rock-throwing example M‘,p 
is a conservative extension of Mpr. 


Proposition 4.2.3. Mj, is a conservative extension of Mpr. 


Proof: It suffices to show that, whatever the setting of U, BT, and ST, the value of BS is 
the same in both Mrr and Mz. It is easy to check this by considering all the cases: in both 
models, BS = 1 if either ST = 1 or BT = 1, and BS = Oif ST = BT = 0 (independent 
of the value of U). fi 


The model Mjr, which has additional endogenous variables SA, BA, SF, and BF, is not 
a conservative extension of either M),, or Mpr. The fact that Suzy and Billy are accurate 
shots and that, if both throw, Suzy hits first, are no longer baked in with Mj. This means 
that Mj differs from Mpr and Mp in some fundamental ways. We can’t really think of 
the contexts as being the same in Mp and M77; the context in 7p determines the values 
of BA, SA, BF, and SF, as well as that of ST and BT. 

As this discussion shows, the requirement in Definition 4.2.1 that 2/ = U/’ is quite a strong 
one. Roughly speaking, it says that whatever extra endogenous variables are added in going 
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from M to M’, their values are determined by the (exogenous and endogenous) variables 
already in M. This is the case in going from Mpr to Mj}: the value of SH is determined by 
that of S77’, whereas the value of BH is determined by that of SH and BT. However, this is 
not the case in going from M‘,, to M/j;p. For example, nothing in /;,- determines the value 
of BA or SA; both are implicitly assumed to always be 1 in Mp. At the end of Section 4.4, 
I discuss a context-relative generalization of the notion of conservative extension that can be 
applied to Mp,, and M‘,p. For now, I continue with the notion of conservative extension as 
defined above. It is quite widely applicable. In particular, in all the examples in Section 4.1, 
the model obtained by adding extra variables is a conservative extension of the original model. 
This can be proved by using arguments similar to those used in Proposition 4.2.3; I leave 
details to the reader. 


4.3 Using the Original HP Definition Instead of the Updated 
Definition 


Recall that Example 2.8.1 was the one that originally motivated using the updated rather than 
the orignal HP definition. In this example, a prisoner dies either if A loads B’s gun and B 
shoots, or if C loads and shoots his gun. In the actual context u, A loads B’s gun, B does not 
shoot, but C does load and shoot his gun, so that the prisoner dies. The original HP definition, 
using AC2(b°), would declare A = 1 a cause of D = 1 (the prisoner dies); the updated 
definition, using AC2(b“), would not, provided that only the variables A, B, C, and D are 
used. But, as I showed, by adding an additional variable, the original HP definition gives 
the desired result. The following result, proved in Section 4.8.2, shows that this approach 
generalizes: we can always use the original HP definition rather than the updated HP definition 
by adding extra variables. 


Theorem 4.3.1 Jf X = x is not a cause of Y = y in (M, it) according to the updated HP 
definition but is a cause according to the original HP definition, then there is a model M' 
that is a conservative extension of M such that X = x is not a cause of Y = y in (M',t) 
according to either the original or updated HP definitions. 


Recall that parts (c) and (d) of Theorem 2.2.3 show that if X = 2 is not a cause of y 
according to the original HP definition, then it is not part of a cause according to the updated 
HP definition. Example 2.8.1 shows that the converse is not true in general. Theorem 4.3.1 
suggests that, by adding extra variables appropriately, we can construct a model where X = 
x is a cause of y according to the original HP definition iff it is a cause according to the 
updated HP definition. As I mentioned earlier, this gives the original HP definition some 
technical advantages. With the original HP definition, causes are always single conjuncts; as 
Example 2.8.2 shows, this is not in general the case with the updated definition. (Many other 
examples show that it is not the case with the modified HP definition either.) Moreover, as I 
discuss in Section 5.3, testing for causality is harder for the updated HP definition (and the 
modified HP definition) than it is for the original HP definition. 

But adding extra variables so as to avoid the use of AC2(b“) may result in a rather “unnatu- 
ral” model (of course, this presumes that we can agree on what “natural” means!). Moreover, 
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as I mentioned earlier, in some cases it does not seem appropriate to add these variables (see 
Chapter 8). More experience is needed to determine which of AC2(b“) and AC2(b°) is most 
appropriate or whether (as I now suspect) using the modified HP definition is the right thing 
to do. Fortunately, in many cases, the causality judgment is independent of which we use. 


4.4 The Stability of (Non-)Causality 


The examples in Section 4.1 raise a potential concern. Consider the rock-throwing example 
again. Adding extra variables changed BT = 1 from being a cause of BS = | to not being 
a cause. Could adding even more variables convert BT = 1 back to being a cause? Could it 
then alternate further? 

Indeed it can, at least in the case of the original and updated HP definition. In general, we 
can convert an event from being a cause to a non-cause and then back again infinitely often 
by adding variables. We have essentially already seen this in the bogus prevention problem. 
Suppose that we just start with Bodyguard, who puts the antidote in the victim’s coffee. To 
start with, there is no assassin, so surely Bodyguard putting in the antidote is not a cause of 
the victim surviving. Now we introduce an assassin. Since there will be a number of potential 
assassins, call this first assassin Assassin #1. Assassin #1 is not a serious assassin. Although 
in principle he could put in poison, and has toyed with the idea, in fact, he does not. As we 
have seen, just the fact that he could put in poison is enough to make Bodyguard putting in the 
antidote a cause of Victim surviving according to the original and updated HP definition and 
part of a cause according to the modified definition. But, as we also saw in Section 3.4, once 
we add a variable PN, that talks about whether the poison was neutralized by the antidote, 
Bodyguard is no longer a cause. 

Now we can repeat this process. Suppose that we discover that there is a second assassin, 
Assassin #2. Now Bodyguard putting in the antidote becomes (part of) a cause again; add 
PNo, and it is no longer a cause. This can go on arbitrarily often. 

I now formalize this. Specifically, I construct a sequence Mo, MM, Mo, .. . of causal models 
and a context u such that /,,+1 is a conservative extension /,, for all n > 0, B = 1 is not 
part of a cause of VS = 1 in the even-numbered models in context u, and B = 1 is part 
of a cause of VS = 1 in the odd-numbered models in context wu. That is, the answer to the 
question “Is B = 1 part of a cause of VS = 1?” alternates as we go along the sequence of 
models. 

Mo is just the model with two binary endogenous variables B and VS and one binary 
exogenous variable U. The values of both B and VS are determined by the context: in the 
context u; where U = j7, B = VS = j, for j € {0,1}. The models M), M2, M3,... are 
defined inductively. For n > 0, we get Mo,+41 from M2, by adding a new variable A,,+1; 
we get M2,,,2 from M2, by adding a new variable PN ,,,,,. Thus, the model M2, has 
the endogenous variables B, VS, A1,..., Anti, PNi,...,PNn; the model M2,,+2 has the 
endogenous variables B, VS, A1,...,Ani1, PNi,.-.,PNn41. All these models have just 
one binary exogenous variable U. The exogenous variable determines the value of B and 
Aj,.--,An41 in models Mo,,41 and M2,,42; in the context u1, these variables all have value 
1. In the context wo, these variables all have value 0; in addition, the equations are such that 
in ug, VS = 0 no matter how the other variables are set. Poor Victim has a terminal illness in 
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ug, SO does not survive no matter what else happens. The equation for PN ; in all the models 
where it is a variable (namely, M2z;, M2;+1,...) 1s just the analogue of the equation for PN 
in the original bogus prevention problem in Section 3.4.1: 


PN, =7A; A B(ie., PN; = (1—A;) x B). 


Assassin #j’s poison is neutralized if he actually puts poison in the coffee (recall A; = 0 
means that assassin #7 puts poison in the coffee) and B puts in antidote. Finally, for VS, the 
equation depends on the model. In all cases, in up, VS = 0; in uy, 


» VS = (A, V PN1)A...A (An V PNn) A (Angi V B) in Mon41; 


VS = (A, V PN) A... A (An V PNn) A (Angi V PNn41) in Mon+a- 


Note that in |, VS = A, V B, andin Mz, VS = A; V PN, so M, and Mg are essentially 
the two models that were considered in Section 3.4.1. 

The following theorem, whose proof can be found in Section 4.8.3, summarizes the situa- 
tion with regard to the models Mo, 1, Mo,... defined above. 


Theorem 4.4.1 For all n > 0, My4+1 is a conservative extension of M,. Moreover, B = 1 
is not part of a cause of VS = 1 in (Mon, uz), and B = 1 is part of a cause of VS = 1 in 
(Mon+41, U1), for n = 0,1, 2,... (according to all variants of the HP definition). 


Theorem 4.4. 1 is somewhat disconcerting. It seems that looking more and more carefully at 
a situation should not result in our view of X = a being a cause of Y = y alternating between 
“yes” and “no”, at least not if we do not discover anything inconsistent with our understanding 
of the relations between previously known variables. Yet, Theorem 4.4.1 shows that this can 
happen. Moreover, the construction used in Theorem 4.4.1 can be applied to any model M 
such that (M,u) E B = 1AC = 1, but B and C are independent of each other (so that, 
in particular, B = 1 is not a cause of C' = 1), to get a sequence of models Mop, Mq,..., 
with M = Mo and M,,+1 a conservative extension of 1/,, such that the truth of the statement 
“B = Lis part of a cause of C = 1 in (M,,, ti)” alternates as we go along the sequence. 

While disconcerting, I do not believe that, in fact, this is a serious problem. First, it is worth 
observing that while Theorem 4.4.1 applies to the original and updated HP definition even for 
(full) causes (this is immediate from the proof), with the modified definition, the fact that we 
are considering parts of causes rather than (full) causes is critical. Indeed, the following result 
is almost immediate from the definition of conservative extension. 


Theorem 4.4.2 If M’ is a conservative extension of M, and X = i is a cause of y in (M, it) 
according to the modified HP definition, then there exists a (not necessarily strict) subset xX, 
of X such that X, = #, is a cause of ~ in (M,%), where £1 is the restriction of & to the 
variables in x. 


Proof: Suppose that ¥ = is acause of y in (M, a) according to the modified HP defini- 
tion. Then (a) (M, a) — x = Apand ) there exists a set W of endogenous variables and a 
setting 7” of the variables in X a setting «i* of the variables in W such that (M, a) | W = w* 

and (IM, vw) LX e# We w*|-y. (Part (b) is just the statement of AC2(a™).) Since M’ 
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is a conservative extension of M, (a) and (b) hold with M replaced by M ’. So the only way 
that X = 7 can fail to be a cause of y in (M’, iz) is if it fails AC3, which means that there 


must be a strict subset of x of X such that X, = £, is acause of vp. fl 


Thus, for the modified HP definition, for causes involving a single variable, causality is 
stable. For causes involving several conjuncts, we can get some instability, but not too much. 
It follows from Theorem 4.4.2 that we can get at most one alternation from non-causality to 
causality to non-causality. Once X= goes from being a cause of y in (JM, %) to being a 
non-cause in (M’, 7), it cannot go back to being a cause again in (14, 7) for a conservative 
extension M” of M’ (since then X, = Z; is a cause of y in (M’, %) for some strict subset 
a of Xx, so there must be a (not necessarily strict) subset X of X such that x =X isa 
cause of y in (M”, uv). Thus, by AC3, X; = £; cannot be a cause of y in (M”, u).) 

The rock-throwing example shows that we can have a little alternation. Suzy’s throw 
(ST = 1) is not a cause of the bottle shattering in the naive rock-throwing model Mpr 
in the context where both Billy and Suzy throw; rather, according to the modified HP defi- 
nition, ST = 1A BT = 1 is a cause, so Suzy’s throw in only part of a cause. In contrast, 
Suzy’s throw is a cause in the more sophisticated rock-throwing model. But it follows from 
Theorem 4.4.2 that Suzy’s throw will continue to be a cause of the bottle shattering in any 
conservative extension of the sophisticated rock-throwing model according to the modified 
HP definition. More generally, arguably, stability is not a problem for full causality using the 
modified HP definition. 


But even according to the original and updated HP definition, and for parts of causes with 
the modified HP definition, I don’t think that the situation is so dire. A child may start with a 
primitive understanding of how the world works and believe that just throwing a rock causes 
a bottle to shatter. Later he may become aware of the importance of the rock actually hitting 
the bottle. Still later, he may become aware of other features critical to bottles shattering. 
This increased awareness can and should result in causality ascriptions changing. However, 
in practice, there are very few new features that should matter. We can make this precise 
by observing that most new features that we become aware of are almost surely irrelevant to 
the bottle shattering, except perhaps in highly abnormal circumstances. If the new variables 
were relevant, we probably would have become aware of them sooner. The fact that we don’t 
become aware of them suggests that we need to use an abnormal setting of these variables to 
uncover the causality. 

As I now show, once we take normality into account, under reasonable assumptions, non- 
causality is stable. To make this precise, I must first extend the notion of conservative ex- 
tension to extended causal models so as to take the normality ordering into account. For 
definiteness, I use the definitions in Section 3.2, where normality is defined on worlds, rather 
than the alternative definition in Section 3.5, although essentially the same arguments work 
in both cases. I do make some brief comments throughout about the modifications needed to 
deal with the alternative definition. 


Definition 4.4.3 An extended causal model M’ = (S’, F’, =’) is a conservative extension 
of an extended causal model M = (S,¥, >) if the causal model (S’, ¥’) underlying M’ is a 
conservative extension of the causal model (S, F) underlying M according to Definition 4.2.1 
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and, in addition, the following condition holds, where V is the set of endogenous variables in 
M: 


CE. For all contexts w, if Ww C Y, then s,> > sz iff 


/ 
Wau,a a7 Sa. 
| 


SWaHd.t 


Roughly speaking, CE says that the normality ordering when restricted to worlds character- 
ized by settings of the variables in V is the same in MM and M’. (Actually, CE says less than 
this. I could have taken a stronger version of CE that would be closer to this English gloss: 
ifWUW' CY, then sy gg = Sy ca a tf Sigg =’ Syyrcagyg: The version of CE 
that I consider suffices to prove the results below, but this stronger version seems reasonable 
as well.) With the alternative definition, where the normality ordering is on contexts, the 
analogue of CE is even simpler: it just requires that the normality ordering in M/ and M’ is 
identical (since the set of contexts is the same in both models). 

For the remainder of this section, I work with extended causal models 17 and M’ and use 
the definition of causality that takes normality into account (i.e., I use AC2*(a) for the original 
and updated definition and use AC2* (a) for the modified definition), although, for ease of 
exposition, I do not mention this explicitly. As above, I take > and >’ to be the preorders in 
M and M’, respectively. 

Note that Theorem 4.4.2 applies to the modified HP definition even for extended causal 
models. Thus, again, stability is not a serious problem for the modified HP definition in 
extended causal models. I now provide a condition that almost ensures that non-causality is 
stable for all the versions of the HP definition. Roughly speaking, I want it to be abnormal for 
a variable to take on a value other than that specified by the equations. Formally, say that in 
world s, V takes on a value other than that specified by the equations in (M, t) if, taking wW* 
to consist of all endogenous variables in 7 other than V, if w* gives the values of the variables 
in W* in s, and v is the value of V in s, then (M,i@) K [W* < w*|(V 4 v). For future 
reference, note that it is easy to check that if W C W* and (M,i) KE [W < w](V F 0), 
then V takes on a value other than that specified by the equations in s,;, aaa The normality 
ordering in V respects the equations for V relative to u if, for all worlds s such that V takes 
on a value in s other than that specified by the equations in (/, tv), we have s 7% siz (where 
= is the preorder on worlds in 7). Using the alternative definition of normality, there is an 
analogous condition: if zi’ is a context where V takes on a value different from that specified 
by the equations in (/, tw), then we have uw’ 7 w. Since the normality ordering on contexts is 
the same in M and M’, this means that unless wu’ ¥ w, the value that V takes on in context 7’ 
in M’ must be that specified by the equations in (M, t). 

To show that B = 1 is a cause of VS = 1 in (Mo2n+41, u1), we consider a witness s where 
An+1 = 0. If we assume that the normality ordering in M2,,41 respects the equations for 
A,,+1 trelative to u;, then s is less normal than s,,,, so cannot be used to satisfy AC2* (a) or 
AC2*(a") when we are considering causality in (M,u ). As a consequence, it follows that 
B = 1is nota cause of VS = 1 in M2,,41 under this assumption. More generally, as I now 
show, if the normality ordering respects the equations for all the variables added in going from 
M to a conservative extension M’ relative to wu, then causality in (1/, z) is stable. Exactly 
how stable it is depends in part on the variant of the HP definition considered. The next 
theorem gives the key result. 
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Theorem 4.4.4 Suppose that M and M' are extended causal models such that M' is a 
conservative extension of M, the normality ordering in M' respects the equations for all 
variables in V' — V relative to ti, where V and V' are the sets of endogenous variables in M 
and M", respectively, and all the variables in y are in V. Then the following holds: 


(a) According to the original and updated HP definition, if X = £ is not a cause of p in 
(M, i), then either X = # is not a cause of y in (M’, it) or there is a strict subset X, 
of X such that X, = Z, is a cause of yp in (M, ti), where &, is the restriction of £ to 
the variables in nee 


(b) According to the modified HP definition, X = iis acause of y in (M,%) iff X = 2 
is a cause of p in (M", tt) (so, in particular, a cause of p in (M"', tw) cannot involve a 
variable in V' — Y). 


It is immediate from part (b) that, under the assumptions of Theorem 4.4.4, both causality 
and non-causality are stable for the modified HP definition. It is also immediate from part 
(a) that, for the original and updated HP definition, non-causality is stable for causes that are 
single conjuncts. I believe that for the original HP definition, non-causality is stable even for 
causes that are not single conjuncts (recall that with the original HP definition, there can be 
causes that are not single conjuncts once we take normality into account; see Example 3.2.3), 
although I have not yet proved this. However, for the updated HP definition, non-causality 
may not be stable for conjuncts that are not single conjuncts (see Example 4.8.2 below). But 
we cannot get too much alternation. Given the appropriate assumptions about the normality 
ordering, the following result shows that X = i cannot go from being a cause of ¢ to not 
being a cause back to being a cause again according to the original and updated HP definition; 
if we have three models 7, MM’, and M”, where M’ is a conservative extension of M, M” 
is a conservative extension of M’, and X = is a cause of y in (M, i) and (M”, i), then 
X = £ must also be a cause of y in (M’, i). 


Theorem 4.4.5 According to the original and updated HP definition, if (a) M' is a conser- 
vative extension of M, (b) M" is a conservative extension of M', (c) X = Zisacause of yp in 
(M, i) and (M", ti), (d) the normality ordering in M' respects the equations for all endoge- 
nous variables in M' not in M relative to ti, and (e) the normality ordering in M" respects 
the equations for all endogenous variables in M" not in M' relative to ti, then X = Zis also 
a cause of p in (M"', i). 


Although these results show that we get stability of causality, it comes at a price, at least in 
the case of the original and updated HP definition: the assumption that the normality ordering 
respects the equations for a variable relative to a context w is clearly quite strong. For one 
thing, for the modified HP definition, it says that there are no “new” causes of y in (M’, v) 
that involve variables that were added in going from M to M’. Very roughly speaking, the 
reason is the following: Because the normality ordering respects the equations in w, not only 
are all the new variables that are added determined by the original variables, under normal 
conditions, in context wu, the new variables have no impact on the original variables beyond 
that already determined in (1, tv). 
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Although it may seem reasonable to require that it be abnormal for the new variables not to 
respect the equations in w, recall the discussion after Example 3.2.6: the normality ordering is 
placed on worlds, which are complete assignments to the endogenous variables, not on com- 
plete assignments to both endogenous and exogenous variables. Put another way, in general, 
the normality ordering does not take the context into account. So saying that the normality 
ordering respects the equations for a variable V relative to w is really saying that, as far as 
V is concerned, what happens in w is really the normal situation. Obviously, how reasonable 
this assumption is depends on the example. 

To see that it is not completely unreasonable, consider the assassin example used to prove 
Theorem 4.4.1. It might be better to think of the variable A,, in this example as being three- 
valued: A, = 0 if assassin #n is present and puts in poison, A, = 1 if assassin #n is 
present and does not put in poison, and A,, = 2 if assassin #£n is not present. Clearly the 
normal value is A, = 2. Take u to be the context where, in model M2,41, An = 2. While 
the potential presence a number of assassins makes bodyguard putting in antidote (part of) 
a cause in (Mp,41,u), it is no longer part of a cause once we take normality into account. 
Moreover, here it does seem reasonable to say that violating the equations for A,, relative to 
u is abnormal. 

These observations suggest why, in general, although the assumption that the normality 
ordering respects the equations for the variables in V’ — V relative to the context w is a strong 
one, it may not be so unreasonable in practice. Typically, the variables that we do not mention 
take on their expected values, and thus are not even noticed. 

Theorem 4.4.4 is the best we can do. In Section 4.8.3, I give an example showing that it is 
possible for X = #to go from being a non-cause of to being a cause to being a non-cause 
again according to the original and updated definition, even taking normality into account. 
(By Theorem 4.4.5, it must then stay a non-cause.) 

There is one other issue to consider: can X = x alternate from being part of a cause of 
y infinitely often, even if we take normality considerations into account? In the case of the 
modified HP definition, it follows easily from Theorem 4.4.4 that the answer is no. 


Theorem 4.4.6 Suppose that M and M' are extended causal models such that M' is a 
conservative extension of M, and the normality ordering in M' respects the equations for all 
variables in V'! — VY relative to ti, where V and V' are the sets of endogenous variables in 
M and M'", respectively. Then X = x is part of a cause of yp in (M,%) according to the 
modified HP definition iff X = x is part of a cause of p in (M', a) according to the modified 
HP definition. 


Proof: If X = x is part of acause X = Zof yin (M, wu) according to the modified HP 
definition, then by Theorem 4.4.4(b), X = # is a cause of y in (M’,u), and hence X = z is 
part of a cause in (M’, u). Conversely, if X = « is part of acause X = #in (M’,u), then by 
Theorem 4.4.4(b) again, all the variables in X are in VY, and X= Z is a cause of y in (M,u), 
so X = x is part of a cause in (M,u). ll 


I conjecture that, under the assumptions of Theorem 4.4.6, if X = x is not part of a cause 
in (MM, wu) according to the updated HP definition, then it is not part of a cause according to 
the original or updated HP definition in (MW’, u) either, but I have not been able to prove this. 
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One final point is worth making. Recall that the model M7, is not a conservative extension 
of Mprr and Mar. The addition of the variables SA, BA, SF’, and BF makes the contexts 
in the two models fundamentally different. Yet, in the context u* in Mpr where the same 
assumptions are made as in M), and Mprr, namely, that Suzy and Billy both throw, are both 
accurate, and Suzy hits first, we have the same causality ascriptions. This is not a fluke. In 
Definition 4.2.1, conservative extension is defined as a relation between two models. We can 
also define what it means for one causal setting (/’,u’) to be a conservative extension of 
another causal setting (7, u). The set of endogenous variables in M is still required to be 
a subset of that in /’, but now we can drop the requirement that / and M’ use the same 
exogenous variables. 


Definition 4.4.7 Given causal models M’ = ((U’,V’,R’),F’) and M = ((U,V,R),F), 
the setting (1/’, u’) is a conservative extension of (M,t) if V C V’, and for all contexts 
u, all variables X € Y, and all settings w of the variables in W = V — {X}, we have 
(M, i’) KE [W < w(X = 2) iff (M’, av’) K[W e o(X = 2). 0 


It is easy to see that, although Mj is not a conservative extension of Mrr or Mp, 
(Mj, u*) is a conservative extension of both (Mprr,u) and (Mpr,u). 

We can further extend this definition to extended causal models by requiring that the con- 
dition CE holds when restricted to the two relevant settings. That is, if Ww Cc Y, we require 
that Siraaag T Sit iff SWow,a >’ sq. For the alternative definition, we can no longer require 
that the normality ordering on contexts be identical; the set of contexts is different in the two 
models. Rather, I require that the normality ordering works the same way on sets definable 
by formulas involving only variables in Y. Formally, I require that for all Boolean combina- 
tions y and y’ of primitive events that involve only the endogenous variables in V, we have 
[Le] =* [¢'] iff [eo] (°)' Ie 'T. 

All the results proved in this section on conservative extensions hold without change if we 
consider conservative extensions of causal settings, with just minimal changes to the proof. 
(I considered conservative extensions of causal models here only because the results are a 
little easier to state.) This helps explain why (Mp7, u) and (Mfr, u*) are the same as far as 
causality ascriptions go. 


4.5 The Range of Variables 


As I said earlier, the set of possible values of a variable must also be chosen carefully. The 
appropriate set of values of a variable will depend on the other variables in the picture and 
the relationship between them. Suppose, for example, that a hapless homeowner comes home 
from a trip to find that his front door is stuck. If he pushes on the door with normal force, 
then it will not open. However, if he leans his shoulder against it and gives a solid push, then 
the door will open. To model this, it suffices to have a variable O with values either 0 or 
1, depending on whether the door opens, and a variable P, with values 0 or | depending on 
whether the homeowner gives a solid push. 

In contrast, suppose that the homeowner also forgot to disarm the security system and the 
system is very sensitive, so it will be tripped by any push on the door, regardless of whether 
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the door opens. Let A = 1 if the alarm goes off, A = 0 otherwise. Now if we try to model the 
situation with the same variable P, we will not be able to express the dependence of the alarm 
on the homeowner’s push. To deal with both O and A, we need to extend P to a 3-valued 
variable P’, with values 0 if the homeowner does not push the door, | if he pushes it with 
normal force, and 2 if he gives it a solid push. 


Although in most cases, a reasonable choice for the range of the variables in a model is 
not hard to come by, there are some important independence principles that apply. That is the 
topic of the next section. 


4.6 Dependence and Independence 


The whole notion of intervention requires that we be able to set variables independently. This 
means that values of different variables should not correspond to logically related events. For 
suppose that we had a model with variable H, and Hz, where H, represents “Martha says 
‘hello’” (i.e., H, = 1 if Martha says “hello” and H; = O otherwise), and H2 represents 
“Martha says ‘hello’ loudly”. The intervention Hy; = 0 A Hz = 1 is meaningless; it is 
logically impossible for Martha not to say “hello” and to say “hello” loudly. 


For this reason, I was careful to say in the examples in Section 4.1 that the variables added 
to help in structuring the scenarios were independent variables, not ones defined in terms of 
the original variables. Thus, for example, the variable D that was added in Example 4.1.1, 
whose equation is given as D = —S (A A, can be thought of as a switch that turns on if S$ 
makes the route from A to C active and A is on; in particular, it can be set to 1 even if, say, A 
is off. 


It is unlikely that a careful modeler would choose variables that have logically related 
values. However, the converse of this principle, that the different values of any particular 
variable should be logically related (in particular, mutually exclusive), is less obvious and 
equally important. Consider Example 2.3.3. Although, in the actual context, Billy’s rock will 
hit the bottle if Suzy’s doesn’t, this is not a necessary relationship. Suppose that, instead of 
using two variables SH and BH, we try to model the scenario with a variable H that takes the 
value | if Suzy’s rock hits and 0 if Billy’s rock hits. If BS = 1 no matter who hits the bottle 
(i.e., no matter what the value of H), then it is not hard to verify that in this model, there is 
no contingency such that the bottle’s shattering depends on Suzy’s throw. The problem is that 
HT = O and H = 1 are not mutually exclusive; there are possible situations in which both 
rocks hit or neither rock hits the bottle. Using the language from earlier in this chapter, this 
model is insufficiently expressive; there are situations significant to the analysis that cannot 
be represented. In particular, in this model, we cannot consider independent interventions on 
the rocks hitting the bottle. As the discussion in Example 2.3.3 shows, it is precisely such an 
intervention that is needed to establish that Suzy’s throw (and not Billy’s) is the actual cause 
of the bottle shattering. 


Although these rules are simple in principle, their application is not always transparent. 
For example, they will have particular consequences for how we should represent events that 
might occur at different times. Consider the following example. 
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Example 4.6.1 Suppose that the Careless Camper (CC for short) has plans to go camping 
on the first weekend in June. He will go camping unless there is a fire in the forest in May. If 
he goes camping, he will leave a campfire unattended, and there will be a forest fire. Let the 
variable C’ take the value 1 if CC goes camping and 0 otherwise. How should the state of the 
forest be represented? 

There appear to be at least three alternatives. The simplest proposal would be to use a 
variable fF that takes the value | if there is a forest fire at some time and 0 otherwise. But in 
that case, how should the dependency relations between F' and C’ be represented? Since CC 
will go camping only if there is no fire (in May), we would want to have an equation such as 
C = -—-F. However, since there will be a fire (in June) iff CC goes camping, we would also 
need F = C’.. This representation is clearly not expressive enough, since it does not let us 
make the clearly relevant distinction between whether the forest fire occurs in May or June. 
As a consequence, the model is not recursive, and the equations have no consistent solution. 
A second alternative would be to use a variable F” that takes the value 0 if there is no fire, 1 
if there is a fire in May, and 2 if there is a fire in June. But now how should the equations be 
written? Since CC will go camping unless there is a fire in May, the equation for C’ should 
say that C = 0 iff F’ = 1. And since there will be a fire in June if CC goes camping, the 
equation for F’ should say that F’ = 2 if C = 1. These equations are highly misleading in 
what they predict about the effects of interventions. For example, the first equation tells us 
that intervening to create a forest fire in June would cause CC to go camping in the beginning 
of June. But this seems to get the causal order backward! 

The third way to model the scenario is to use two separate variables, F and F, to represent 
the state of the forest at separate times. fF) = 1 will represent a fire in May, and F) = 0 
represents no fire in May; F2 = | represents a fire in June and F2 = 0 represents no fire in 
June. Now we can write our equations as C = 1 — F, and Fy = C x (1 — F,). This 
representation is free from the defects that plague the other two representations. We have no 
cycles, and hence there will be a consistent solution for any value of the exogenous variables. 
Moreover, this model correctly tells us that only an intervention on the state of the forest in 
May will affect CC’s camping plans. ff 


The problem illustrated in this example could also arise in Example 2.4.1. Recall that in 
that example, there was a variable BMC that represented Billy’s medical condition, with pos- 
sible values 0 (Billy feels fine on both Tuesday and Wednesday), 1 (Billy feels sick Tuesday 
morning and fine Wednesday), 2 (Billy feels sick both Tuesday and Wednesday), and 3 (Billy 
feels fine Tuesday and is dead on Wednesday). Now suppose that we change the story so that 
the second dose has no impact, but if Billy feels fine on Tuesday, he will do a dangerous stunt 
dive on Wednesday and die. If we add a variable SD that captures whether Billy does a stunt 
dive, then SD = lif BMC is either 0 or 3 and BMC = 3 if SD = 1; again we get circularity. 
What this shows is that the problem is due to the combination of having a causal model with a 
variable X whose values correspond to different times together with another variable causally 
“sandwiched” between the different values of X. 

Similarly, the first representation suggested in Example 4.6.1 is inappropriate only because 
it is important for the example when the fire occurs. The model would be insufficiently ex- 
pressive unless it could capture that. However, if the only effect of the forest fire that we are 
considering is the level of tourism in August, then the first representation may be perfectly 
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adequate. Since a fire in May or June will have the same impact on August tourism, we need 
not distinguish between the two possibilities in our model. 

All this shows that causal modeling can be subtle, especially with variables whose values 
are time dependent. The following example emphasizes this point. 


Example 4.6.2 It seems that being born can be viewed as a cause of one’s death. After all, if 
one hadn’t been born, then one wouldn’t have died. But this sounds a little odd. If Jones dies 
suddenly one night, shortly before his 80th birthday, then the coroner’s inquest is unlikely 
to list “birth” as among the causes of his death. Typically, when we investigate the causes 
of death, we are interested in what makes the difference between a person’s dying and his 
surviving. So a model might include a variable D such that D = 1 holds if Jones dies shortly 
before his 80th birthday, and D = 0 holds if he continues to live. If the model also includes a 
variable B, taking the value 1 if Jones is born, 0 otherwise, then there simply is no value that 
D would take if B = 0. Both D = 0 and D = 1 implicitly presuppose that Jones was born 
(i.e., B = 1). Thus, if the model includes a variable such as D, then it should not also include 
B, and so we will not be able to conclude that Jones’s birth is a cause of his death! More 
generally, including the variable D in our model amounts to building in the presupposition 
that Jones exists and lives to be 79. Only variables whose values are compatible with that 
presupposition can be included in the model. ff 


4.7 Dealing With Normality and Typicality 


As we have seen, adding a normality theory to the model gives the HP definition greater 
flexibility to deal with many cases. This raises the worry, however, that this gives the modeler 
too much flexibility. After all, the modeler can now render any claim that A is an actual cause 
of B false by simply choosing a normality order that makes the actual world s;; more normal 
than any world s needed to satisfy AC2. Thus, the introduction of normality exacerbates the 
problem of motivating and defending a particular choice of model. 

There was a discussion in Chapter 3 about various interpretations of normality: statistical 
norms, prescriptive norms, moral norms, societal norms, norms that are dictated by policies 
(as in the Knobe-Fraser experiment), and norms of proper functioning. The law also suggests 
a variety of principles for determining the norms that are used in the evaluation of actual 
causation. In criminal law, norms are determined by direct legislation. For example, if there 
are legal standards for the strength of seat belts in an automobile, a seat belt that did not meet 
this standard could be judged a cause of a traffic fatality. By contrast, if a seat belt complied 
with the legal standard, but nonetheless broke because of the extreme forces it was subjected to 
during a particular accident, the fatality would be blamed on the circumstances of the accident, 
rather than the seat belt. In such a case, the manufacturers of the seat belt would not be guilty 
of criminal negligence. In contract law, compliance with the terms of a contract has the force 
of anorm. In tort law, actions are often judged against the standard of “the reasonable person”. 
For instance, if a bystander was harmed when a pedestrian who was legally crossing the street 
suddenly jumped out of the way of an oncoming car, the pedestrian would not be held liable 
for damages to the bystander, since he acted as the hypothetical “reasonable person” would 
have done in similar circumstances. There are also a number of circumstances in which 
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deliberate malicious acts of third parties are considered to be “abnormal” interventions and 
affect the assessment of causation. 

As with the choice of variables, I do not expect that these considerations will always suffice 
to pick out a uniquely correct theory of normality for a causal model. They do, however, 
provide resources for a rational critique of models. 


4.8 Proofs 
4.8.1 Proof of Lemma 4.2.2 


Lemma 4.2.2: Suppose that M' is a conservative extension of M = ((U,V,R), F). Then for 


all causal formulas p that mention only variables in V and all contexts ti, we have (M, ti) F 
iff (M’,@) F @. 

Proof: Fix a context wu. Since M is a recursive model, there is some partial order Xz on 
the endogenous variables such that unless X <7 Y, Y is not affected by X in context wu; 
that is, unless X <q Y, if the exogenous variables are set to w, then changing the value of 
X has no impact on the value of Y according to the structural equations in M/, no matter 
what the setting of the other endogenous variables. Say that X is independent of a set W of 
endogenous variables in (M, it) if X is independent of Y in (M, @) for all Y € W. 

Suppose that V = {X1,...,X,}. Since M is a recursive model, we can assume without 
loss of generality that these variables are ordered so that X; is independent of {X;41,..., Xn} 
in (MM, @) for 1 <i <n—1. (Every set V’ of endogenous variables in a recursive model must 
contain a “minimal” element X such that it is not the case that X’ xz X for X’ € V’—{X}. 
If such an element did not exist, we could create a cycle. Thus, X is independent of V — {X} 
in (M, w). We construct the ordering by induction, finding X;,1 by applying this observation 
to V — {X,,...,X;}.) I now prove by induction on j that, for all Ww C Y, all settings w 
of the variables in W, and all a; € R(X;), we have (M,i) K [W © wi(X; = 2;) iff 
(M’,a) E (We a(x j = @;): all interventions that make sense in both models produce 
the same results. 

For the base case of the induction, given W, let W’ = V — (WU {X}}). Choose wi’ such 
that (M’, i) EK [W < w](W = a’). Then we have 


(M, it) K [W — w](X1 = 21) 

(M, i) E [W — &,W! © w](X, = 21) [since X, is independent of W’ in (M, a)] 
(M', i) E [WH @,W! + w|(X, = 24) [since M’ is a conservative extension of M] 
(M',@) E (We @\(X, = 21) [by Lemma 2.10.2]. 


This completes the proof of the base case. Suppose that 1 < 7 < n_and the result 
holds for 1,...,7 — 1; I prove it for 7. Given W, now let W’ = V — (W U {X;}), let 
Wi = W'n{X,... , X;—1}, and let Ws = W' — W1. Thus, W} consists of those vari- 
ables that might affect X, that are not already being intervened on (because they are in W), 
whereas W3 consists of those variables that do not affect X; and are not already being inter- 
vened on. Since W4 is contained in {Xo4tyss0y An}, Ky 18 Independent of W3 in (M, @). 
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=> 


Choose 7, such that (M,i) E [W < w](W/ = w). Since Wi C {X1,...,Xj-1}, by 
the radeon hypothesis, (M’,i) K [W < w|(W{ = wi). By Lemma 2.10.2, we have 
(M,@) (We i (Xj = 2;) iff (M,@) ie + w,Wi © wi](X; = x;), and similarly 
for MM’. Choose wi, such that (M’, v) & [ w@, Wi <— w](W2 = wy). Thus, 


(M, @)  [W — wl(X; = 2;) 
iff (M,@) EK [We i, Wy + wale = 2;) [as observed above] 
iff (M,@) - [W - @, Wi + i, Wh wh] (X; = 2;) 
[since X; is independent of W3 in (M, z)] 


iff (M’,@) K [W + @, Wi + wt, Whe w](X; =2;) 

[since M ‘is a conservative extension of /] 
iff (M’,@) K [W + @, Wi + wi](X; = 2;) [by Lemma 2.10.2] 
iff (M’,@) K [W © wl(X; =2;) [as observed above]. 


This completes the proof of the inductive step. 

It follows easily from the definition of - that (IM, i) K [W © w](w1 A Wo) iff (M, a) E 
[We wld, A[W & wb. and (M,a%) E [W & wl, iff (M,a@) E A[W © old, 
and similarly for M’. An easy induction shows that (M,i) / [W < ww iff (M’,a@) - 
(w + ww for an arbitrary Boolean combination 7 of primitive events that mentions only 
variables in VY. Since causal formulas are Boolean combinations of formulas of the form 
[W < wz), another easy induction shows that (M, i) / w’ iff (’, i) / wu for all causal 
formulas ~’. fl 


4.8.2 Proof of Theorem 4.3.1 


Theorem 4.3.1: If X = x is not a cause of Y = y in (M,%) according to the updated HP 
definition but is a cause according to the original HP definition, then there is a model M' 
that is a conservative extension of M such that X = «x is not a cause of Y = y in (M', it) 
according to either the original or updated HP definitions. 


Proof: Suppose that (W, w, x’) is a witness to X = x being a cause of Y = y in (M, a) 
according to the original HP definition. Let (/, vi) — W = w*. We must have w # w", for 
otherwise it is easy to see that X = x would be a cause of Y = y in (M, w/) according to the 
updated HP definition, with witness (W, w, x’). 

If M’ is a conservative extension of M with nies variables V’, say that (W’, w', x’) 
extends (Ww, 2’) if W C W’ C WU Y’ and w agrees with w on the variables in W. I 
now construct a conservative extension M’ of M such that X = x is not a cause of Y = y 
in (M’, i) according to the original HP definition with a witness extending (W’, w, x’). Of 
course, this just kills the witnesses in M’ that extend (W’, w, 2’). I then show that we can 
construct further conservative extensions of MV to kill all other extensions of witnesses to 
X = x being acause of Y = y in (M, u) according to the original HP definition. 

Let M’ be obtained from M by adding one new variable NW. All the variables have the 
same equations in M/ and M’ except for Y and (of course) NW. The equations for NW are 
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easy to explain: if X = x and W =, then NW = 1; otherwise, NW = 0. The equations 
for Y are the same in M and M’ (and do not depend on the value of NW) except for two 
ssi cases. To define these cases, for each variable Z € V — W, if x” € {x,a’} define 

v/a as the value such that (V/, at [x eo" We wl(Z = = 2, wg). That is, 2,7 .g is the 
ae taken ey Z if X is set to x” and W is set to w. Let Vv consist of all variables in V other 
than Y, let v” be a setting of the variables in V’, and let Z’ consist of all variables in V’ — W 
other than X. Then we want the equations for Y in M’ to be such that for all 7 € {0,1}, we 
have 


(M,i) E[V' H @(Y = y") iff (M’,@) E[V’ He, NW Ce 5Y =y") 


unless the assignment V! — @ results in either (a) X = a, W=6,Z Za for all Z € Z, 
and NW = O0or()X =2,W=0,Z = Ze g@ for all Z € Z', and NW = 1. That 
is, the structural equations for Y are the same in M and M’ except for two special cases, 
characterized by (a) and (b). If (a) holds, Y = y/ in M’; if (b) holds, Y = y. Note that in 
both of these cases, the value of NW is “abnormal”. If X = x, W= w, and Z = 2,3 for 
allz eZ ‘, then NW should be 1; if we set X to x’ and change the values of the variables in 
z accordingly, then NW should be 0. 

I now show that 1’ has the desired properties and, in addition, does not make X = xa 
cause in new ways. 


Lemma 4.8.1 


(a) It is not the case that X = x is a cause of Y = y in (M',%t) according to the updated 
HP definition with a witness that extends (W, w, x’). 


(b) M' is a conservative extension of M. 


(c) If X = «x is a cause of Y = y in (M',%) according to the updated (resp., origi- 
nal) HP definition with a witness extending (W',1', x") then X = x is a cause of 
= y in (M, t) according to the updated (resp., Breina HP definition with witness 
(W’, a", 2"). 


Proof: For part (a), suppose, by way of contradiction, that X = x is a cause of Y = y 
n (M’, i) according to the updated HP definition with a witness (W’, «’, x’) that extends 
(W,w,2’). If NW ¢ W’, then W’ = W. But then, since (M’,z@) KE NW = 0 
and (M’, t) [X « «;W «+ @,NW = OY = y/’), it follows that (M’,@) - 
[X + 2,W + wl(Y =y’), so AC2(b”) fails, contradicting the assumption that X = x is 
a cause of Y = y in (M’,%) according to the original HP definition. Now suppose that 
NW € W’. There are two cases, depending on how the value of NW is set in w. If 
NW = 0, since (M’,v/) F [X «+ 2,W + w,NW < OY = y’), AC2(b°) fails; and if 
NW =1, since (M’,@) K [X + 2',W Hw, NW © I|(Y = y), AC2(a) fails. So, in all 
cases, we get a contradiction to the assumption that X = a is a cause of Y = y in (M’,%) 
according to the original HP definition with a witness (W’, «’, x’) that extends (W, w, x’). 
For part (b), note that the only variable in V for which the equations in M and M’ are 
different is Y. Consider any setting of the variables in VY other than Y. Except for the two 


l 
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special cases noted above, the value of Y is clearly the same in M and M’. But for these 
two special cases, as was noted above, the value of NW is “abnormal”, that is, it is not the 
same as its value according to the equations given the setting of the other variables. It follows 
that for all settings @ of the variables V’ in V other than Y and all values 1 of Y, we have 
(M, i) K [V’ — OY = y”) iff (M',@) K [V’ — O(Y = y”). Thus, M’ is a conservative 
extension of !/. 

For part (c), suppose that X = a is acause of Y = yin (M’, , tu) see to the updated 
(resp., original) HP definition with witness (W",w”, 2”). Let W’ and wi be the restrictions 
of W” and w” , respectively, to the variables in V. If NW ¢ w" (so that Ww" = W’), then, 
since M’ is a conservative extension of M, it easily follows that (W’, 1’, 2””) is a witness 
to X = «x being a cause of Y = y in (M, iw) according to the updated (resp. original) HP 
definition. If NW € W”’, then it suffices to show that (W’, «’, x") is also a witness to X = x 
being a cause of Y = y in (’, i); that is, NW does not nlae an essential role in the witness. 
I now do this. 

If NW = 0isaconjunct of W” = w”, since the equations for Y are the same in M and 
M' except for two cases, then the only way that NW = 0 can play an essential role in the 
witness is if setting W’ = Ww and X = x results in W = wand Z = 2,0 for all Z € Zz! 
(i.e., we are in the first of the two cases where the value of Y does not agree in (1, tw) and 
(M’,z)). But then Y = y’, so if this were the case, AC2(b°) (and hence AC2(b“)) would not 
hold. Similarly, if NW = = lis aconjunct of W” = Ww”, NW plays a role only if x” = 2’ 
and setting W! = w' and X = a’ results in results in W = wand Z = Zo for all Z € az 
(i.e., We are in ae second of the two cases where the value of Y does not agree in (1, w) 
and (1M’,i)). But then Y = y, so if this were the case, AC2(a) would not hold, and again 
we would have a contradiction to XY = x being a cause of Y = y in (M’,%) with witness 
(W”, 2”). Thus, (W’, w’, 2") must be a witness to X = x being a cause of Y = y in 
(M’, i), and hence also in (M, i, This completes the proof of part (c). Ii 


Lemma 4.8.1 is not quite enough to complete the proof of Theorem 4.3.1. There may be 
several witnesses to X = x being a cause of Y = y in (M, t) according to the original HP 
definition. Although we have removed one of the witnesses, some others may remain, so that 
X = x may still be a cause of Y = y in (M’,%). But by Lemma 4.8.1(c), if there is a 
witness to X = x being a cause of Y = y in (M’, wv), it must extend a witness to X = x 
being a cause of Y = y in (M/,w). We can repeat the construction of Lemma 4.8.1 to kill 
this witness as well. Since there are only finitely many witnesses to X = x being a cause of 
Y = yin (M, zw), after finitely many extensions, we can kill them all. After this is done, we 
have a causal model 1/* extending M such that X = z is not a cause of Y = y in (M*, u) 
according to the original HP definition. §f 


It is interesting to apply the construction of Theorem 4.3.1 to Example 2.8.1 (the loaded- 
gun example). The variable NW added by the construction is almost identical to the extra 
variable B’ (where B’ = AA B—B shoots a loaded gun) added in Example 2.8.1 to show that 
this example could be dealt with using the original HP definition. Indeed, the only difference 
is that NW = O0if A = B = C = 1, while B’ = 1 in this case. But since D = 1 if 
A= B=C=1and NW = 0, the equations for D are the same in both causal models if 
A= B=C =1. Although it seems strange, given our understanding of the meaning of the 
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variables, to have NW = O if A = B = C = 1, it is easy to see that this definition works 
equally well in showing that A = 1 is not a cause of D = 1 in the context where A = 1, 
B=0,and C' = 1 according to the original HP definition. 


4.8.3 Proofs and example for Section 4.4 


I start with the proof of Theorem 4.4.1. 


Theorem 4.4.1: For all n > 0, My, is a conservative extension of M,. Moreover, B = 1 
is not part of a cause of VS = 1 in (Mon, ur), and B = 1 is part of a cause of VS = 1 in 
(Mon+41, U1), forn = 0,1,2,... (according to all variants of the HP definition). 


Proof: Fix n > 0. To see that M>2,,;1 is a conservative extension of M2,,, note that, aside 
from VS, the equations for all endogenous variables that appear in both M2, and Mon441 
(namely, B, VS, Ay,...,An, PN1,...,PN,) are the same in both models. Thus, it clearly 
suffices to show that, for every setting of the variables U, B, Ai,..., An, PNi,...,PNn, 
the value of VS is the same in both M2, and M2,,+41. Clearly, in the context up, VS = 0 in 
both models. In the context uw, note that if both A; and PN; are 0 for some j € {1,...,n}, 
then VS = 0 in both Mo, and M2n+41. (Again, recall that A; = 0 means that assassin #j 
puts poison in the coffee.) If one of A; and PN; is | forall j € {1,...,n}, then VS = 1in 
both M2, and M,,41 (in the latter case because Ay41 = 1 in (Mon41, U1)). 

The argument to show that M2,,,2 is a conservative extension of M 2,4, is similar in 
spirit. Now the equations for all variables but VS and PN 1; are the same in Mo,,44 
and Mo2n+2. It clearly suffices to show that, for every setting of the variables U, B, 
Aj,.--,; An, Anti, PN1,...,PNn, the value of VS is the same in both M2,41 and Mon42. 
Clearly in context wo, VS = 0 in both models. In uj, note that if both A; and PN ; are 0 
for some j € {1,...,n}, then VS = 0 in both Mo,41 and M2,42. So suppose that one 
of A; and PN; is 1 for all j € {1,...,n}. If Angi = 1, then VS = 1 in both Mon41 
and M242; andif A,4+, = 0, then VS = B in both M2,,41 and M2,,+2 (in the latter case, 
because VS = PN »41, and PN,,4, = Bif Any, = 0). 

To see that B = 1 is part of a cause of VS = 1 in (Mo,,+1, u1) according to all variants of 
the HP definition, observe that 


(Mon41,U1) F [B < 0, Anyi + 0](VS = 0). 


Clearly just setting B to 0 or A,,+1 to 0 is not enough to make VS' = 0, no matter which other 
variables are held at their actual values. Thus, B = 1 A Any, = 1 is a cause of VS = 1 
in (M2,,+41, U1) according to the modified HP definition. It follows from Theorem 2.2.3 that 
B= 1isacause of VS = 1 in (Mon41, uz) according to the original and updated definition 
as well. 

Finally, to see that, according to all the variants of the HP definition, B = 1 is not part 
of a cause of VS = 1 in (Mon42, U1), first suppose, by way of contradiction, that it is a 
cause according to the original HP definition, with witness (W, w,0). By AC2(a), we must 
have (Mon42,t1) F [B + 0,W + w\(VS = 0). Thus, it must be the case that either 
(1) there is some 7 < n+ 1 such that A;, PN; € W and in w, both A; and PN; are 


set to 0; or (2) forsome 7 < n+1, A; € W, PN; € Z, and A; is set to 0 in w. In 
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case (1), we must have (Mgn42,u1) K [B — 1,W < w](VS = 0), so AC2(b?) does 
not hold. In case (2), since PN; € Z. PN, = 0 in the actual world, and (Mon42,u1) — 
[BH1,W < w, PN; < 0](VS = 0) (since A; is set to 0 when W is set to), AC2(b°) 
again does not hold. Thus, B = 1 is not a cause of VS' = 1 in (M2n+42, u,) according to the 
original HP definition. (Not surprisingly, this is just a generalization of the argument used in 
Section 3.4.1.) By Theorem 2.2.3, it follows that B = 1 is not part of a cause of VS = 1 in 
(Mon+2, U1) according to the updated or modified HP definition either. 


I next prove Theorem 4.4.4. 
Theorem 4.4.4: Suppose that M and M’ are extended causal models such that M' is a 
conservative extension of M, the normality ordering in M' respects the equations for all 
variables in V' — V relative to ti, where VY and V' are the sets of endogenous variables in M 
and M", respectively, and all the variables in yp are in V. Then the following holds: 


(a) According to the original and updated HP definition, if X = £ is not a cause of yp in 
(M, uw), then either X = is not a cause of y in (M’, i) or there is a strict subset Xy 
of X such that X= = Z, is a cause of y in (M,%t), where &, is the restriction of & to 
the variables in Ru 


(b) According to the modified HP definition, X = Zis a cause of v in (M,%) iff X =—2 
is a cause of p in (M", tt) (so, in particular, a cause of p in (M', tw) cannot involve a 
variable in V' — Y). 


Proof: | first prove part (a) for the updated definition. Suppose that X = iis not a cause 
of y in (M, tw) according to the updated HP definition, but is a cause of ¢p in (M’, i), with 
witness (W, w, x"). Let W, be the intersection of W with V, the set of endogenous variables 
in M, let Z,= =Y- W, and let w, be the restriction of w to the variables i in Wi. Since X = Z 
is not a cause of y in (1, 1%), it is certainly not a cause with witness (W1, Wr, 2’), so either 
(i) (M,%) x # EV 7p (ie., ACI is violated); (ii) (M,v) —& [x + #,Wy te wy or 
Saiz Wy, ?~ sq (i.e., AC2*(a) is violated); (iii) there exist subsets wi of Wi and Zi 


of Zz, such that if (M, U)- a = 2, G. e., Z, gives the actual values of the variables in Zi i: 
then (1, 7) K [X + 2,Wi ce wi, Z| & Z]y (Le., AC2(b") is violated); or (iv) there is a 
strict subset xX, of X such that X= = £1 isa cause of y in (M, tw), where 7; is the restriction 
of Z to the variables in X 1 (ie., AC3 is violated). Since MM’ is a conservative extension of VM, 
by Lemma 4. 2. 2, if (i) or (iii) holds, then the same statement holds with M replaced by M’, 
showing that X = is not a cause of y in (M’, 7) with witness (W, w, 7’). 

Now suppose that (1i) holds. For each variable V € W- Wi, let v be the value of V in w. 
Further assume that for all V € W — Wy, we have (M’,@) K [X + 2,W, ~ w|(V =v). 1 
show that this additional assumption leads to a contradiction. With this sadiipnal assumption, 
if (M, i) K [X + #,W, < diy, then (MJ @) K [X + 2,W © wi]g, so if the reason 
that X = Z is not a cause of y in (M, u) is that AC2(a) is vialated then AC2(a) is also 
violated in (M’,i), and X = Z is not a cause of y in (M’, i) with witness (W, 7, 2’), 
a contradiction. Moreover, it follows that s¢_o W,og,.¢ = SX=2 Wau, By CE in the 


- : eee ; 
definition of conservative extension, if s Raz Wyaw, t 7% sq then s Sa2 Waa, 2 2’ sq, and 


Downloaded from http://direct.mit.edu/books/book-pdf/2262849/book_9780262336611.pdf by guest on 04 April 2024 


4.8 Proofs 133 


hence s¢_ = waz L Sa. So if the reason that X = Zis not a cause of y in (M, a) is that 
the normality condition in AC2*(a) is violated, it is also violated in (/’, @), and again we 
get a contradiction. 

Thus, we must have (M’,i) EK [X < @,W, < w,|\(V 4 v) for some variable 
V © W—W,,. That means that the variable V takes on a value other than that specified 
by the equations in (M’, v). Since, by assumption, /’ respects the equations for V relative 
to wu, we have Stig Wea 2%’ sz, contradicting the assumption that X = Zisa cause of 
y in (M’, i) with witness (W,«, 2’). Either way, if (ii) holds, we get a contradiction. I 
leave it to the reader to check that the same contradiction can be obtained using the alternative 
version of AC2*(a) where normality is defined on contexts (and similarly for the remaining 
arguments in the proof of this result). 

Thus, either X = Z is not a cause of y in (M’, a) or X; = 2; is a cause of y in (M, i) 
for some strict subset X 1 of Xx. 

I leave it to the reader to check that the same argument goes through with essentially no 
change in the case of the original HP definition. For future reference, note that it also goes 
through with minimal change for the modified HP definition; we simply use AC2(a™) rather 
than AC2(a) and drop case (iii). 

For part (b), first suppose by way of contradiction that, according to the modified HP 
definition, X = is not a cause of y in (M’, i) but is a cause of y in (M, a) with witness 
(W,«w,#). Since M’ is a conservative extension of M, AC2+(a”) must hold for X = 


with witness (W, w, Z’). (Here again, I am using condition CE to argue that s,; AS 


W=w, x= 
more normal than sj in J’ because this is the case in 1/7.) Thus, if X = Zis not a cause of 
y in M’, AC3 must fail, and there must be a strict subset xX of X such that x = 7, isa 
cause of y in (M’, v), where 2 is the restriction of 7 to x; Clearly X, = 2; is not a cause 
of y in (M, i), for otherwise X = Z would not be a cause of y in (M, it). Thus, as observed 
in the proof of part (a), since x 1 = £, is a cause of y in (M’, a) according to the modified 
HP definition, there must be a strict subset ve of X , such that X= = £2 is a cause of y in 
(M, v) according to the modified HP definition, where “> is the restriction of 7 to X>. But 
this contradicts the assumption that X = Zisacause of y in (M, tv), since then AC3 does not 


hold. 
For the converse, Suppose that X = #is a cause of y in (M’, u) with witness Wes wi a"). 
Thus (M’, ti) [x + #,W € why. Let X, = XV, let W, = Wo, let X2 = -X, 


let We = =W- W,, and let 2 x’. and ww; be the restrictions of x’ and w to the ae eae in x: 
and Wi, respectively, for i = 1,2. Since the normality ordering in 1/4’ respects the equa- 
tions for all the variables in VY’ — Y relative to w, it is not hard to show that (M’,t%) [| 


[ea + 2, Wi < 1] (Xo = © \ W. = wW2). For otherwise, Skog wea % sz, and 
AC2*(a™) would not hold, contradicting the assumption that x 7 = 2’ is a cause of y in 
(M’', i) with witness (W, w,Z”). Since (M’, ud) & [X + 2#,W «+ wry, it follows by 
Lemma 2.10.2 that (M’,@) K [X, + #,Wi wi). Since we are dealing with the 
modified HP definition, (M’, a) E (W = w) A y, so X, must be nonempty. We must have 
x l= X, otherwise X = Z cannot be a cause of y in (M’,i); we would have a violation 
of AC3. I claim that X = 7 is a cause of y in (M, tw), as desired. For if not, we can apply 
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the same argument as in part (a) (which, as I observed above, also holds 1 for the modified HP 
definition) to show that there must be a strict subset Bs 1 of X such that X | = 7%, isa cause of 
y in (M', wt), where 7 is the restriction of Z to X,. But this contradicts the assumption that 
X = Zisacause of yin (M’,z). I 


Theorem 4.4.5: According to the original and updated HP definition, if (a) M' is a conser- 
vative extension of M, (b) M" is a conservative extension of M', (c) X = Zisacause of (p in 
(M, ti) and (M", ti), (d) the normality ordering in M' respects the equations for all endoge- 
nous variables in M' not in M relative to ti, and (e) the normality ordering in M" respects 
the equations for all endogenous variables in M" not in M' relative to ti, then X = Zis also 
a cause of p in (M', d). 


Proof: The same argument work for both the original and updated HP definition. Suppose 
by way of contradiction that X = 7 is not a cause of ¢ in (M’, vz). By Theorem 4.4.1 (a), 
there must be a strict subset X, of X such that X, = Z, is a cause of yin (M’, %), where 2, 
is the restriction of Z to the variables in ®. But x = £, cannot be a cause of y in (MM, x), 
for then, by AC3, X = Z would not be a cause of y in (M, wu). By Theorem 4.4.1(a) again, 
there must be a strict subset Xe of Ve such that es = £> is a cause of y in (M, iw), where 
#o is the restriction of Z to X5. But then, by AC3, X = £ cannot be a cause of yin (M, x), 
giving us the desired contradiction. Jf 


I next give an example showing that it is possible for X =Zto go from being a non-cause 
of to being a cause to being a non-cause again according to the updated HP definition, even 
taking normality into account, showing that Theorem 4.4.5 is the best we can do. Because 
this type of behavior can happen only if causes are not single conjuncts, it is perhaps not so 
surprising that the example is a variant of Example 2.8.2, which shows that according to the 
updated HP definition, a cause may involve more than one conjunct. 


Example 4.8.2 I start with a simplified version of the model in Example 2.8.2, where there 
is no variable D’. Again, A votes for a candidate. A’s vote is recorded in two optical scanners 
Band C. D collects the output of the scanners. The candidate wins (i.e., WIN = 1) if any 
of A, B, or D is 1. The value of A is determined by the exogenous variable. The following 
structural equations characterize the remaining variables: 


» B=A, 
»"C=A, 
®#D=BAC, 


« WIN=AVBVD. 


Call the resulting causal model MM. In the actual context u, A = 1,so B= C= D= 
WIN = 1. Assume that all worlds in M are equally normal. 

I claim that B = 1 is a cause of WIN = 1 in (M,u) according to the updated HP 
definition. To see this, take W = {A}. Consider the contingency where A = 0. Clearly if 
B=0, then WIN = 0, andif B = 1, then WIN = 1. It is easy to check that AC2 holds. 
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Moreover, since B = 1 is a cause of WIN = 1 in (M,u), by AC3, B= 1AC = 1 cannot 
be a cause of WIN = 1in (M,u). 

Now consider the model M’ from Example 2.8.2. Recall that it is just like 7, except that 
there is one more exogenous variable D’, where D’ = B \ =A. The equation for WIN now 
becomes WIN = AV D’\V D. All the other equations in M/’ are the same as those in M. 
Define the normality ordering in M’ so that it respects the equations for D’ relative to u: all 
worlds where D’ = B A —A are equally normal; all worlds where D’ 4 B (\ —A are also 
equally normal, but less normal than worlds where D’ = B A 7A. 

It is easy to see that M’ is a conservative extension of M. Since D’ does not affect any 
variable but WIN, and all the equations except that for WIN are unchanged, it suffices to 
show that for all settings of the variables other than D’ and WIN, WIN has the same value 
in context u in both M and M’. Clearly if A = 1 or D = 1, then WIN = 1 in both M and 
M’. So suppose that we set A = D = 0. Now if B = 1, then D’ = 1 (since A = 0), so again 
WIN = 1inboth M and M’. In contrast, if B = 0, then D’ = 0, so WIN = 0 in both M 
and MM’. Condition CE clearly holds as well. As shown in Example 2.8.2, B = 1/AC = Lis 
a cause of WIN = 1 in (M’,u). 

Finally, consider the model 4” that is just like M’ except that it has one additional variable 
D", where D” = D \ —A and the equation for WIN becomes WIN = AV D/ V D”. All 
the other equations in 1/7” are the same as those in M/’. Define the normality ordering in M’ 
so that it respects the equations for both D’ and D” relative to wu. 

It is easy to check that 1/4” is a conservative extension of MM’. Since D” does not affect 
any variable but WIN, and all the equations except that for WIN are unchanged, it suffices to 
show that for all settings of the variables other than D” and WIN, WIN has the same value 
in context wu in both / and M’. Clearly if A = 1 or D’ = 1, then WIN = 1 in both M’ and 
M". Andif A = D = 0, then D’ = 1 iff D = 1, so again the value of WIN is the same in 
M’ and M”. Condition CE clearly holds as well. 

Finally, I claim that B = 1 A C = 1 is no longer a cause of WIN = 1 in (M",w) 
according to the updated HP definition. Suppose, by way of contradiction, that it is, with 
witness (W, w, Z”). A = 0 must be a conjunct of W = W. It is easy to see that either D’ = 0 
is aconjunct of W = wor D’ ¢ W, and similarly for D”. Since D’ = D” = 0 in the context 
uand (M",u) — [A — 0,D’ + 0,D” + O|(WIN = 0), it easily follows that AC2(b") 
does not hold, regardless of whether D’ and D” are in Ww. 

Thus, B = 1AC = 1 goes from not being a cause of WIN = 1 in (MV, u) to being a cause 
of WIN = 1 in (M’, u) to not being a cause of of WIN = 1in(M”,u). ll 


Notes 


Paul and Hall [2013], among many others, seem to take it for granted that causality should be 
objective, but they do express sympathy for what they call a “mixed approach” that “would be 
sensitive to the need to give a suitably objective account of the causal structure of the world, 
yet recognize that any account that can do justice to our causal intuitions will have to have 
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some pragmatic features ...” [Paul and Hall 2013, p. 254]. Arguably, the HP approach is 
such a mixed approach. 

Most of the discussion in Sections 4.1—4.4 is taken from [Halpern 2014a]. The example 
regarding different causes of a traffic accident is a variant of an example originally due to 
Hanson [1958]. 

Issues of stability have been considered before. Strevens [2008] provides an example 
where what Strevens calls a cause can become a non-cause if extra variables are added ac- 
cording to Woodward’s [2003] definition of causality (actually, Strevens considered what 
Woodward called a contributing cause); Eberhardt [2014] shows that this can also happen 
for type causality using Woodward’s definition. However, [Halpern 2014a] seems to have 
been the first systematic investigation of stability. Huber [2013] defines causality using a 
ranking function on worlds and proposes a condition similar in spirit to that of a normality 
ordering respecting the equations (although he does not apply it in the context of stability). 

Example 4.1.1 is due to Spohn [personal communication, 2012]. Example 4.1.2 is due to 
Weslake [2015]. Example 4.1.3 is a slight simplification of an example due to Hall [2007]. 
(Hall [2007] also has variables C' and F such that C = B and F = E; adding them does 
not affect any of the discussion here (or in Hall’s paper).) Of the conclusion that D = 1 
is a cause of F = 1 in (M’,u), Hall [2007] says, “This result is plainly silly, and doesn’t 
look any less silly if you insist that causal claims must always be relativized to a model.” I 
disagree. To be more precise, I would argue that Hall has in mind a particular picture of the 
world: that captured by model MM. Of course, if that is the “right” picture of the world, then 
the conclusion that D = 1 is a cause of B = 1 is indeed plainly silly. But for the story with 
Xavier, it does not seem (to me) at all unreasonable that D = 1 should be a cause of B = 1. 

Example 4.1.4 is due to Glymour et al. [2010]; Example 4.1.5 is taken from [Halpern 
2015a]. Example 4.6.1 is a simplification of an example introduced by Bennett [1987]. Hitch- 
cock [2012] has an extended discussion of Example 4.6.1. 

Most of the discussion in Sections 4.5-4.7 is taken from [Halpern and Hitchcock 2010]. 
In particular, the question of how the variables and the ranges of variables in a causal model 
should be chosen is discussed there. Woodward [2016] picks up on the theme of variable 
choice and provides some candidate criteria, such as choosing variables that are well-defined 
targets for interventions and choosing variables that can be manipulated to any of their possi- 
ble values independently of the values taken by other variables. 

The issue of the set of possible values of a variable is related to discussions about the 
metaphysics of “events”. (Note that usage of the word “event” in the philosophy literature 
is different from the typical usage of the word in computer science and probability, where 
an event is just a subset of the state space. See the notes at the end of Chapter 2 for more 
discussion.) Suppose that the homeowner in Section 4.5 pushed on the door with enough 
force to open it. Some philosophers, such as Davidson [1967], have argued that this should 
be viewed as just one event, the push, that can be described at various levels of detail, such 
as a “push” or a “hard push”. Others, such as Kim [1973] and Lewis [1986c], have argued 
that there are many different events corresponding to these different descriptions. If we take 
the latter view, which of the many events that occur should be counted as causes of the door’s 
opening? In the HP definition, these questions must all be dealt with at the time that the 
variables describing a model and their ranges are chosen. The question then becomes “what 
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choice of variables describes the effects of interventions in a way that is not misleading or 
ambiguous”. Of course, whether a particular description does so is a matter that can be argued 
(by lawyers, among others). 

The fact that someone is not held liable for damages if he acts as the hypothetical “rea- 
sonable person” would have done in similar circumstances is discussed by Hart and Honoré 
[1985, pp. 142ff.]. They also discuss circumstances under which malicious acts of third par- 
ties are considered to be “abnormal” interventions and affect the assessment of causation (see 
[Hart and Honoré 1985, pp. 68]). 
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Chapter 5 


Complexity and Axiomatization 


The complexities of cause and effect defy analysis. 
Douglas Adams, Dirk Gently’s Holistic Detective Agency 


Is it plausible that people actually work with structural equations and (extended) causal mod- 
els to evaluate actual causation? People are cognitively limited. If we represent the structural 
equations and the normality ordering in what is perhaps the most obvious way, the models 
quickly become large and complicated, even with a small number of variables. 

To understand the problem, consider a doctor trying to deal with a patient who has just 
come in reporting bad headaches. Let’s keep things simple. Suppose that the doctor considers 
only a small number of variables that might be relevant: stress, constriction of blood vessels 
in the brain, aspirin consumption, and trauma to the head. Again, keeping things simple, 
assume that each of these variables (including headaches) is binary; that is, has only two 
possible values. So, for example, the patient either has a headache or not. Each variable may 
depend on the value of the other four. To represent the structural equation for the variable 
“headaches”, a causal model will need to assign a value to “headaches” for each of the sixteen 
possible values of the other four variables. That means that there are 2'6__over 60,000!— 
possible equations for “headaches”. Considering all five variables, there are 2°° (over 107+) 
possible sets of equations. Now consider the normality ordering. With five binary variables, 
there are 2° = 32 possible assignments of values to these variables. Think of each of these 
assignments as a “possible world”. There are 32! (roughly 2.6 x 103°) strict orders of these 
32 worlds, and many more if we allow for ties or incomparable worlds. Altogether, the doctor 
would need to store close to two hundred bits of information just to represent this simple 
extended causal model. 

Now suppose we consider a more realistic model with 50 random variables. Then the 
same arguments show that we would need as many as 25°*2"” possible sets of equations, 25° 
possible worlds, and over 9502" normality orderings (in general, with n binary variables, 
there are 2”2” sets of equations, 2” possible worlds, and (2”)! ~ 2”2” strict orders). Thus, 
with 50 variables, roughly 50 x 2°° bits would be needed to representing a causal model. This 
is clearly cognitively unrealistic. 


139 
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If the account of actual causation given in the preceding chapters is to be at all realistic 
as a model of human causal judgment, some form of compact representation will be needed. 
Fortunately, as I show in this chapter, in practice, things are likely to be much better; it is 
quite reasonable to expect that most “natural” models can be represented compactly. At a 
minimum, we cannot use these concerns about compact representations to argue that people 
cannot be using (something like) causal models to reason about causality. 

But even given a compact representation, how hard is it to actually compute whether _X = 
x is acause of Y = y? It is not hard to exhaustively check all the possibilities if there are 
relatively few variables involved or in structures with a great deal of symmetry. This has been 
the case for all the examples I have considered thus far. But, as we shall see in Chapter 8, 
actual causality can also be of great use in, for example, reasoning about the correctness 
of programs and in answering database queries. Large programs and databases may well 
have many variables, so now the question of the complexity of determining actual causality 
becomes a significant issue. Computer science provides tools to formally characterize the 
complexity of determining actual causation; I discuss these in Section 5.3. 

In many cases, we do not have a completely specified causal model. Rather, what we know 
are certain facts about the model: for example, we might know that, although X is actually 
1, if X were 0, then Y would be 0. The question is what conclusions about causality we can 
draw from such information. In Section 5.4, I discuss axioms for causal reasoning. 

Note to the reader: the four sections in this chapter can be read independently. The material 
in this chapter is more technical than that of the other chapters in this book; I have deferred 
the most technical to Section 5.5. The rest of the material should be accessible even to readers 
with relatively little mathematical background. 


5.1 Compact Representations of Structural Equations 


The calculations above on the complexity of representing structural equations just gave naive 
upper bounds. Suppose that whether the patient has a headache is represented by the variable 
HT, and the patient has a headache iff any of the conditions X,,...,X; holds. Rather than 


painfully writing out the 2" possible settings of the variables X,,..., Xj, and the correspond- 
ing setting of H, we can simply write one line: H = X, V...V Xx. Indeed, as long as all 
variables are binary, we can always describe the effect of X1,..., X; on H by asingle propo- 


sitional formula. However, in general, this formula will have a length that is exponential in k; 
we cannot count on it having a short description as in this example. So, in the worst case, we 
have either 2" short equations or one long equation. Can anything be done to mitigate this? 
The first step toward the goal of getting a compact representation comes from the obser- 
vation that similar representational difficulties arise when it comes to reasoning about proba- 
bility. For example, if a doctor would like to reason probabilistically about a situation where 
there are 50 binary variables describing symptoms and potential diseases that a patient might 
have, just describing a probability distribution on the 2°° possible worlds would also require 
2°° (more precisely, 2°° — 1) numbers. Bayesian networks have been used to provide com- 
pact representations of probability distributions; they exploit (conditional) independencies 
between variables. As I mentioned in the notes to Chapter 2, the causal networks that we have 
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been using to describe causal models are similar in spirit to Bayesian networks. This sim- 
ilarity is quite suggestive; the same ideas that allow Bayesian networks to provide compact 
representations of probability distributions can be applied in the setting of causal models. 


Although a thorough discussion of Bayesian networks is well beyond the scope of this 
book, it is easy to see where independence comes in, both in the probabilistic setting and in 
the setting of causality. If X1,...,X,, are independent binary variables, then we can represent 
a probability distribution on the 2” possible worlds generated by these variables using only 
nm numbers rather than 2” numbers. Specifically, the values Pr(X; = 0), i = 1,...,n, 
completely determine the distribution. From them, we can compute Pr(X; = 1) fori = 
1,15, The probability Pry = 4 A...X_ =t,) ford; € 10,1}, 9 = 1,...,.%, is then 
determined, thanks to the assumption that X,,..., X,, are independent. 


Of course, it is rarely the case that we are considering a situation described by n variables 
that are all independent of each other. Most situations of interest involve variables that are 
somehow related to each other. However, it is still the case that there are often some variables 
that are (conditionally) independent of others. It turns out that if no variable depends on 
too many other variables, then we can get a significantly more compact representation of 
the distribution. Similarly, if the structural equations are such that each variable depends 
on the values of only a few other variables, then we can again get a significantly simpler 
representation of the equations. 


Consider the model M7, for the rock-throwing example again. The variable BS is affected 
by BT; if ST = 0 (Suzy doesn’t throw), then BS = BT. However, the effect of BT on BS 
is mediated by BH. Once we know the value of BH, the value of BS is irrelevant. We can 
think of BS as being independent of BT’, conditional on BH. Similarly, BS is independent 
of ST given SH. So, although BS is affected by all of ST, BT, BH, and SH, its value is 
completely determined by SH and BH. This is made clear by the equation that we used for 
BS: BS = SH \V BH. 


A causal network describes these conditional independencies. The key point is that the 
value of a variable in the graph is completely determined by the values of its parents in the 
graph. As long as each variable has at most & parents in the graph, we can describe the struc- 
tural equations using 2*n equations in the worst case, rather than 2”~!n equations. (Needless 
to say, the connection to Bayesian networks is not purely coincidental!) Again, if all variables 
are binary, we can effectively combine the 2'n or 2°—!n into one formula, but in the worst 
case, the length of the formula will be comparable to the total length of the short equations 
corresponding to each setting. 


In practice, it often seems to be the case that the value of each variable in a causal network 
is determined by the values of at most /& other variables, where & is relatively small. And 
when some variable does depend on quite a few other variables, we can sometimes get away 
with ignoring many of these variables because their impact is relatively small. Alternatively, 
we can hope that, if the value of a variable depends on that of many variables, the dependence 
is a simple one, as in the case of the headache example above, so that it can be described by 
a short equation, perhaps with a few special cases. To take just one example, consider voting. 
The outcome of a vote may well depend on how each voter votes, and there can be many 
voters. But voting rules can typically be described by relatively simple equations (majority 
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rules; if there are more than two candidates, someone must get more than 50% of the vote or 
there is a runoff; no more than 20% can abstain; ...). 

The bottom line here is that it seems likely that, just as with the probability distributions 
that arise in practice, the structural equations that arise in practice are likely to have compact 
representations. Of course, this is a statement that needs to be experimentally verified. But it is 
worth pointing out that in all the examples given in this book, the equations can be represented 
compactly. 

These compact representations have an added benefit. Recall that in Section 2.5, where I 
discussed adding probability to structural equations, I pointed out that it was useful to have 
a natural way to represent certain probabilistic facts. Bayesian networks go a long way to 
helping in this regard. However, they are not a complete solution. For example, there is no 
natural way to represent in a Bayesian network the fact that counterfactual outcomes such 
as “the bottle would have toppled over had only Suzy’s rock hit it” and “the bottle would 
have toppled over had only Billy’s rock hit it” are independent (see Example 2.5.4), using 
only the variables we have used for the rock-throwing example so far. However, if we add 
variables SO and BO to the model that characterize whether the bottle would have toppled 
over if Suzy (resp., Billy) had hit it, and take these variables to be determined by the context, 
then we can easily say that these two outcomes are independent. Even with the variables 
used in Example 2.5.4, we can say, for example, that Billy throwing and Suzy throwing are 
independent events. Of course, it is not at all clear that these variables are independent; it is 
easy to imagine that Suzy throwing can influence Billy throwing, and vice versa, so it is more 
likely that both throw than that only one of them throws. The point is that using a Bayesian 
network, we can represent these probabilistic independencies. Not surprisingly, here too the 
choice of variables used to describe the model can make a big difference. 


5.2 Compact Representations of the Normality Ordering 


I now turn to the problem of representing a normality ordering compactly. There are two main 
ideas. We have already seen the first: taking advantage of independencies. Although we used 
a partial preorder = to represent the normality ordering, not probability, it turns out that the 
“technology” of Bayesian networks can be applied far more broadly than just probability; we 
just need a structure that has a number of minimal properties and an analogue of (conditional) 
independence. Somewhat surprisingly perhaps, it turns out that partial preorders fit the bill. 
The second idea can be applied if the normality ordering largely reflects the causal structure 
in the sense that what is normal is what agrees with the equations. An example should make 
this point clearer: Suppose that the causal structure is such that if the patient suffers a head 
trauma, then he would also suffer from headaches. Then we would expect a world where the 
patient has a headache and a trauma to be more normal, all else being equal, than a world in 
which the patient has a trauma and does not have a headache. In this way, a representation of 
causal structure can do “double duty” by representing much of the normality order as well. 
In the remainder of this section, I discuss these ideas in more detail. I remark that although 
I talk about a partial preorder on worlds in this section, these ideas apply equally well to 
the alternate representation of normality that involves a partial preorder on contexts. Since a 
context u can be identified with the world s;, a partial preorder on worlds induces a partial 
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preorder on contexts and vice versa (although not all worlds may be of the form sz for some 
context ti). 


5.2.1 Algebraic plausibility measures: the big picture 


Probability measures are just one way of representing uncertainty; many other representations 
of uncertainty have been discussed in the literature. In particular, many of the representations 
of typicality and normality mentioned in the bibliographic notes for Chapter 3 can be viewed 
as representations of uncertainty. This includes partial preorders; one way of interpreting 
s > 8’ is that the world s is more likely than the world s’. Plausibility measures are gen- 
eralized representations of uncertainty; probability and all the representations of uncertainty 
mentioned in the notes of Chapter 3 are instances of plausibility measures. 

The basic idea behind plausibility measures is straightforward. A probability measure on 
a finite set W of worlds maps subsets of W to [0,1]. A plausibility measure is more general; 
it maps subsets of W to some arbitrary set D partially ordered by <. If Pl is a plausibility 
measure, then Pl(U) denotes the plausibility of U. If Pl(U) < PI(V), then V is at least as 
plausible as U. Because the order is partial, it could be that the plausibility of two different 
sets is incomparable. An agent may not be prepared to order two sets in terms of plausibility. 
D is assumed to contain two special elements, | and T, such that | <d<T forall de D. 
We require that Pl() = L and Pl(W) = 7. Thus, and T are the analogues of 0 and 1 for 
probability. We further require that if U C V, then P(U) < Pl(V). This seems reasonable; 
a superset of U should be at least as plausible as U. Since we want to talk about conditional 
independence here, we will need to deal with what have been called conditional plausibility 
measures (cpms), not just plausibility measures. A conditional plausibility measure maps 
pairs of subsets of W to some partially ordered set D. I write Pl(U | V) rather than P1(U, V), 
in keeping with standard notation for conditioning. I typically write just P1(U) rather than 
P1(U | W) (so unconditional plausibility is identified with conditioning on the whole space). 

In the case of a probability measure Pr, it is standard to take Pr(U | V) to be undefined 
if Pr(V) = 0. In general, we must make precise what the allowable second arguments of a 
cpm are. For simplicity here, I assume that Pl(U | V) is defined as long as V 4 (). For each 
fixed V # @, Pl(- | V) is required to be a plausibility measure on W. More generally, the 
following properties are required: 


CPI. P1(O| V) =. 

CPI2. PI(W | V) =T. 

CPI. If U CU’, then PL(U | V) < PL(U’ | V). 
CPI4. PI(U | V) =PI(UNV|V). 


CPI1 and CPI12 just say that the empty set has plausibility _ and the whole space has plausi- 
bility T, conditional on any set V. These are the obvious analogues of the requirements for 
conditional probability that the empty set has probability 0 and the whole space probability 
1, conditional on any set. CP13 just says that the plausibility of a superset cannot be less than 
the plausibility of a subset. (Note that this means that these plausibilities are comparable; in 
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general, the plausibilities of two sets may be incomparable.) Finally, CP14 just says that the 
plausibility U conditional on V depends only on the part of U in V. 

There are some other obvious properties we might want to require, such as generalizations 
of CPI1 and CPI2 that say that P(U | V) = LifUNV =@and PIU |V)=TifV CU. 
In the case of conditional probability, the analogues of these properties follow from the fact 
that Pr(V; U V2 | V) = Pr(Vi | V) + Pr(V2 | V) if Vi and V2 are disjoint sets. Another 
important property of conditional probability is that Pr(U | V’) = Pr(U | V) x Pr(V | V’) 
if U C V C V’. These properties turn out to play a critical role in guaranteeing compact 
representations of probability distributions, so we will need plausibilistic analogues of them. 
This means that we need to have plausibilistic analogues of addition and multiplication so that 
we can take 


PI(V; U Vo | V3) = PU(V; | V3) @ Pl(Ve | V3) if Vi and V2 are disjoint, 


and Pl(V; | V3) =P1(V; | Va) ® Pl(Vo | Vs) if Vi C Vo C Vs. oe 


A cpm that satisfies these properties is said to be an algebraic cpm. (I give a more formal 
definition in Section 5.5.1.) Of course, for probability, (5.1) holds if we take 6 and ® to be 
+ and x, respectively. I mentioned in the notes to Chapter 3 that another approach to giving 
semantics to normality and typicality uses what are called ranking functions; (5.1) holds for 
ranking functions if we take & and ® to be min and +, respectively. 

Both the original and alternative approach to incorporating normality are based on a partial 
preorder. In the original approach, it is a partial preorder on worlds, whereas in the alternative 
approach, it is a partial preorder on contexts, which is lifted to a partial preorder on sets of 
contexts. We would like to be able to view these partial preorders as algebraic cpms. There 
is a problem though. A plausibility measure attaches a plausibility to sets of worlds, not 
single worlds. This is compatible with the alternative approach, for which an ordering on 
events was critical. Indeed, when defining the alternative approach, I showed how a partial 
preorder > on contexts can be extended to a partial preorder on =° sets of contexts. This 
order on sets essentially defines a plausibility measure. But I need even more here. I want not 
just an arbitrary plausibility measure but an algebraic cpm. Getting an appropriate algebraic 
cpm involves some technical details, so I defer the construction to Appendix 5.5.1. For the 
purposes of this section, it suffices to just accept that this can be done; specifically, assume 
that a partial preorder on worlds (or contexts) can be extended to an algebraic cpm defined 
on sets of worlds (or contexts). The key point is that, once we do this, we can talk about 
the independence and conditional independence of endogenous variables, just as in the case 
of probability. Moreover, it means that the normality ordering can be represented using the 
analogue of a causal network, where again a node labeled by X is conditionally independent 
of all its ancestors in the graph given its parents (just as BS is independent of BT given 
BH). This will allow us to apply the “technology” of Bayesian networks to the normality 
ordering almost without change. Although the graph will, in general, be different from the 
graph representing the (structural equations of) the causal model, in many cases, there will 
be substantial overlap between the two. As discussed in Section 5.2.2, this allows for even 
greater representational economy. 

Although this is helpful, in a sense, it is the “wrong” result. This result says that if we have 
a partial preorder on worlds, it can be represented compactly. But the real question is how do 
agents come up with a normality ordering in the first place? To understand the point, consider 
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a lawyer arguing that a defendant should be convicted of arson. The lawyer will attempt to 
establish a claim of actual causation: that the defendant’s action of lighting a match was an 
actual cause of the forest fire. To do this, she will need to convince the jury that a certain 
extended causal model is correct and that certain initial conditions obtained (for example, that 
the defendant did indeed light a match). To justify the causal model, she will need to defend 
the equations. This might involve convincing the jury that the defendant’s match was the sort 
of thing that could cause a forest fire (the wood was dry), and that there would have been no 
fire in the absence of some triggering event, such as a lightning strike or an act of arson. 

The lawyer will also have to defend a normality ordering. To do this, she might argue that 
a lightning strike could not have been reasonably foreseen, that lighting a fire in the forest 
at that time of year was in violation of a statute, and that it was to be expected that a forest 
fire would result from such an act. It will usually be easier to justify a normality ordering in a 
piecemeal fashion. Instead of arguing for a particular normality ordering on entire worlds, she 
argues that individual variables typically take certain values in certain situations. By defining 
the normality ordering for a variable (or for a variable conditional on other variables taking on 
certain values) and making some independence assumptions, we can completely characterize 
the normality ordering. 

The idea is that if a situation is characterized by the random variables X1,..., Xn, so that 
a world has the form (x1,..., Zp), then we assume that each world has a plausibility value of 
the form a, ® --: ® a, For example, ifn = 3, X, and X92 are independent, and X3 depends 
on both X; and Xo, then a world (1,0, 1) would be assigned a plausibility of a; ® az ® as, 
where a, is the unconditional plausibility of X; = 1, az is the unconditional plausibility of 
X2 = 0, and az is the plausibility of X3 = 1 conditional on X; = 11 Xz = 0. We do 
not need to actually define the operation ®; we just leave aj ® ag ® az as an uninterpreted 
expression. However, if we have some constraints on the relative order of elements (as we do 
in our example and typically will in practice), then we can lift this to an order on expressions 
of the form aj ® az @ az by taking a] @ ag ® a3 < a ® a4 @ a’, if and only if, for all aj, 
there is a distinct ai such that a; < ay. The “only if” builds in a minimality assumption: 
two elements a1 ® az © ag and a‘, ® a ® a’, are incomparable unless they are forced to be 
comparable by the ordering relations among the a;s and the ais. One advantage of using a 
partial preorder, rather than a total preorder, is that we can do this. 

Although this may all seem rather mysterious, the application of these ideas to specific 
cases is often quite intuitive. The following two examples show how this construction works 
in our running example. 


Example 5.2.1 Consider the forest-fire example again. We can represent the independencies 
in the forest-fire example using the causal network given in Figure 5.1 (where the exogenous 
variable is omitted): L and MD are independent, and F'F' depends on both of them. Thus, 
we do indeed have a compact network describing the relevant (in)dependencies. Note that the 
same network represents the (in)dependencies in both the conjunctive and disjunctive models. 


For the remainder of this example, for definiteness, consider the disjunctive model, where 
either a lightning strike or an arsonist’s match suffices for fire. It would be natural to say 
that lightning strikes and arson attacks are atypical, and that a forest fire typically occurs if 
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M L 


FF 


Figure 5.1: The forest-fire example. 


either of these events occurs. Suppose that we use dt to represent the plausibility of L = 0 
(lightning not occurring) and d;, to represent the plausibility of L = 1; similarly, we use diy D 
to represent the plausibility of MD = 0 and dj,p to represent the plausibility of MD = 1. 
Now the question is what we should take the conditional plausibility of FF = 0 and FF = 1 
to be, given each of the four possible settings of L and MD. For simplicity, take all the four 
values compatible with the equations to be equally plausible, with plausibility dias Then 
using P] to represent the plausibility measure, we have: 


PI(L = 0) = dt > dz =PUL=1) 
PI(MD = 0) = dj, Sd, = PMD =1) 


( 
( 
PI(FF =0|L=0A MD =0) = dbp > dpp = PUFF =1| L=0A MD =0) 
PI(FF =1|L=1A MD =0) = dbp > dpp = PUFF =0| L=1A MD =0) 
PI(FF =1|L=0A MD =1) =dbp > dpp = PUFF =0| L=0A MD =1) 
PI(FF =1|L=1A MD =1) =dbp > dpp = PUFF =0| L=1A MD =1) 


Suppose that we further assume that dt, Ors and dy are all incomparable, as are d; , dj,, 
and d7,,7. Thus, for example, we cannot compare the degree of typicality of no lightning with 
that of no arson attacks or the degree of atypicality of lightning with that of an arson attack. 
Using the construction discussed gives us the ordering on worlds given in Figure 5.2, where 
an arrow from w to w’ indicates that w’ > w. 

In this normality ordering, (0,1, 1) is more normal than (1,1, 1) and (0,0, 1) but incom- 
parable with (1,0,0) and (0,0, 1). That is because, according to our construction, (0, 1,1), 
(1, 1,1), (0,1, 0), (1,0, 0), and (0,0, 1) have Buby di @d7, @dhe,d, Od, @ tip, 
dy ® dy, ® dip, d; ® di, u © app, and dj ® i, u ® ee respectively. The fact that 
dy @ dy ®dpp = G ® dy, ® dig follows because ay > d,. The fact that we have 
>, not just >, follows from the fact that we do not have d; ® dj, ® dan > dp @ dy, @ dpps 
it does not follow from our condition from comparability. The other comparisons follow from 
similar arguments. For example, the fact that (0,1, 1) is incomparable with (1,0, 0) follows 
because neither df ® dj, ®dfp < df @d}{, @ dp, nor d} @d{, @dpp < di @ dy @dpp 
follows from our conditions. J 


Example 5.2.1 shows how the “Bayesian network technology” can be used to compute 
all the relevant plausibilities (and thus the normality ordering). To go from the qualitative 
representation of (in)dependencies in terms of a causal network to a more quantitative rep- 
resentation, we need the conditional plausibility of the value of each variable given each 
possible setting of its parents. For variables that have no parents, this amounts to giving the 
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(1, 1,0) 
(1,1,1) AY (1,0, 0) 
(0, 0, 0) 


Figure 5.2: A normality order on worlds. 


unconditional plausibility of each of the values of the variable. This is exactly what is done 
on the first two lines of Table (5.2) for the variables LZ and MD, which have no parents in the 
causal network (other than the exogenous variable). The next four lines of Table (5.2) give the 
plausibility of the two values of FF’, conditional on the four possible settings of its parents in 
the network, L and MD. As is shown in the rest of the discussion in the example, these values 
suffice to compute all the relevant plausibilities. 

Although the savings in this case is not so great, the savings can be quite significant if we 
have larger models. If there are n binary variables, and each one has at most & parents, rather 
than describing the plausibility of the 2” worlds, we need to give, for each variable, at most 
2k+1 plausibilities (the plausibility of each value of the variable conditional on each of the 2* 
possible settings of values for its parents). Thus, we need at most n2**1 plausibilities. Of 
course, this is not much of a savings if k is close to n, but if k is relatively small and n is 
large, then n2**+! is much smaller than 2". For example, if k = 5 and n = 100 (quite realistic 
numbers for a medical example; indeed, n may be even larger), then n2*+! = 6, 400, while 
2” = 2100 % 1033. In general, we have to consider all the ways of ordering these 6, 400 (or 
2109) plausibility values, but as the discussion in the example shows, the additional assump- 
tions on the ordering of expressions of the form a; ®---®a,, makes this straightforward (and 
easy to represent) as well. 


Example 5.2.2. The preorder on worlds induced by the Bayesian network in the previous 
example treats the lightning and the arsonist’s actions as incomparable. For example, the 
world (1,0, 1), where lightning strikes, the arsonist doesn’t, and there is a fire, is incomparable 
with the world where lightning doesn’t strike, the arsonist lights his match, and the fire occurs. 
But this is not the only possibility. Suppose we judge that it would be more atypical for the 
arsonist to light a fire than for lightning to strike and also more typical for the arsonist not to 
light a fire than for lightning not to strike. (Unlike the case of probability, the latter does not 
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follow from the former.) Recall that this order might reflect the fact that arson is illegal and 
immoral, rather than the frequency of occurrence of arson as opposed to lightning. Although 
Table (5.2) still describes the conditional plausibility relations, we now have dt > dir and 
dy, > d;,. This gives us the ordering on worlds described in Figure 5.3. 


(1, 1,0) 
(1,154) (1, 0,0) 
(0, 1,1) (1.0.1) (0,0, 1) 
(0, 0,0) 


Figure 5.3: A different normality ordering on worlds. 


Now, for example, the world (0,1,1) is strictly more normal than the world (1,0, 1); 
again, the former has plausibility df @ dj, @ df,, whereas the latter has plausibility 
d; @dj, @df,. But since dj, > dz; and dj > dj,, by assumption, it follows that 
di @dy @di, >, @ dj, ® dép. I 


5.2.2. Piggy-backing on the causal model 


As I noted earlier, the graph representing the normality ordering can, in general, be different 
from that representing the causal model. Nonetheless, in many cases there will be substan- 
tial agreement between them. When this happens, it will be possible to make parts of the 
causal model do “double duty”: representing both the causal structure and the structure of the 
normality ordering. In Examples 5.2.1 and 5.2.2, the graph describing the causal structure 
is the same as that representing the normality ordering. We assumed that L and MD were 
independent with regard to both the structural equations and normality, but that F'F’ depended 
on both LZ and MD. In the case of the structural equations, the independence of L and MD 
means that, once we set all other variables, a change in the value of Z cannot affect the value 
of MD, and vice versa. In the case of normality, it means that the degree of normality of a 
world where lightning occurred (resp., did not occur) does not affect the degree of normality 
of a world where the arsonist dropped a match (resp., did not drop a match), and vice versa. 
But we can say something more “quantitative”, beyond just the qualitative statement of 
independence. Consider the entries for the conditional plausibility of variable F'F’ from Table 
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(5.2), which can be summarized as follows: 


+ . 
PI FF = ff |L=l1AMD=m)= { ae mae) 
dypp otherwise. 

Recall that FF = max(L, MD) is the structural equation for FF in the causal model. So 
the conditional plausibility table says, in effect, that it is typical for FF’ to obey the structural 
equations and atypical to violate it. 

Variables typically obey the structural equations. Thus, it is often far more efficient to 
assume that this holds by default, and explicitly enumerate cases where this is not so, rather 
than writing out all the equations. Consider the following default rule. 


Default Rule 1 (Normal Causality): Let X be a variable in a causal model with 
no exogenous parents, and let Ypac x) be the vector of parents of X. Let the 
structural equation for X be X = Fx (Yea x))- Then, unless explicitly given 
otherwise, there are two plausibility values d oe and dy, with d > dy such that 


2 dt ife=Fx(¥ 

PIS =a Pea) a { i ie 
Default Rule | says that it is typical for variables to satisfy the equations unless we explicitly 
stipulate otherwise. Moreover, it says that, by default, all values of variables that satisfy the 
equations are equally typical, whereas all those that do not satisfy the equations are equally 
atypical. In Examples 5.2.1 and 5.2.2, FF satisfies Default Rule 1. Of course, we could 
allow some deviations from the equations to be more atypical than others; this would be a 
violation of the default rule. As the name suggests, the default rule is to be assumed unless 
explicitly stated otherwise. The hope is that there will be relatively few violations, so there is 
still substantial representational economy in assuming the rule. That is, the hope is that, once 
a causal model is given, the normality ordering can be represented efficiently by providing 
the conditional plausibility tables for only those variables that violate the default rule, or 
whose plausibility values are not determined by the default rule (because they have exogenous 
parents). It may, of course, be possible to formulate more complex versions of Default Rule | 
that accommodate exogenous parents and allow for more than two default values. The rule as 
stated seems like a reasonable compromise between simplicity and general applicability. 

The Normal Causality rule, by itself, does not tell us how the plausibility values for one 
variable compare to the plausibility values for another variable. Default Rule 1 can be supple- 
mented with a second rule: 


Default Rule 2 (Minimality): If d, and d, are plausibility values for distinct 
variables X and Y, and no information is given explicitly regarding the relative 
orders of dz and dy, then dz and d, are incomparable. 


Again, this default rule is assumed to hold only if there is no explicit stipulation to the con- 
trary. Default Rule 2 says that the normality ordering among possible worlds should not 
include any comparisons that do not follow from the equations (via Default Rule 1) together 
with the information that is explicitly given. Interestingly, in the context of probability, a 
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distribution that maximizes entropy subject to some constraints is the one that (very roughly) 
makes things “as equal as possible” subject to the constraints. If there are no constraints, it 
reduces to the classic principle of indifference, which assigns equal probability to different 
possibilities in the absence of any reason to think some are more probable. In the context 
of plausibility, where only a partial order is assigned, it is possible to push this idea a step 
further by making the possibilities incomparable. In Example 5.2.1, all three variables satisfy 
Minimality. In Example 5.2.2, FF’ satisfies Minimality with respect to the other two variables, 
but the variables L and MD do not satisfy it with respect to one another (since their values 
are stipulated to be comparable). 

With these two default rules, the extended causal model for the disjunctive model of the 
forest fire can be represented succinctly as follows: 


» FF = max(L, MD) 
«= PIL =0) > PI(L=1) 
« PI(MD = 0) > PI(MD = 1). 


The rest of the structure of the normality ordering follows from the default rules. 
The extended causal model in Example 5.2.2 can be represented just using the following 
equations and inequalities: 


» FF = max(L, MD) 
= Pl(MD = 0) > PI(L =0) > P\(L = 1) > PI(MD = 1). 


Again, the rest of the structure follows from the default rules. In each case, the normality 
ordering among the eight possible worlds can be represented with the addition of just a few 
plausibility values to the causal model. Thus, moving from a causal model to an extended 
causal model need not impose enormous cognitive demands. 

Exceptions to the default rules can come in many forms. There could be values of the 
variables for which violations of the equations are more typical than agreements with the 
equations. As was suggested after Default Rule 1, there could be multiple values of typicality, 
rather than just two for each variable. Or the conditional plausibility values of one variable 
could be comparable with those of another variable. These default rules are useful to the 
extent that there are relatively few violations of them. For some settings, other default rules 
may also be useful; the two rules above are certainly not the only possible useful defaults. 

Given these examples, the reader may wonder whether the causal structure is always iden- 
tical to the normality ordering. It is easy to see that there are counterexamples to this. Recall 
the Knobe and Fraser experiment from Section 3.1, in which Professor Smith and the admin- 
istrative assistant took the two remaining pens. Suppose that we modify the example slightly 
by assuming that the department chair has the option of instituting a policy forbidding faculty 
members from taking pens. Further suppose that Professor Smith will ignore the policy even 
if it is instituted and take a pen if he needs. it. We can model this story using the binary 
variables C’P (which is | if the chairman institutes the policy and 0 otherwise), PS (which is 
1 if Professor Smith takes the pen and 0 otherwise), and PO (which is 1| if a problem occurs 
and 0 otherwise). Now, by assumption, as far as the causal structures goes, PS' is independent 
of CP: Professor Smith takes the pen regardless of whether there is a policy. However, as far 
as the normality ordering goes, PS does depend on CP. 
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5.3. The Complexity of Determining Causality 


In this section I discuss the computational complexity of determining causality. It seems 
clear that the modified definition is conceptually simpler than the original and updated HP 
definition. This conceptual simplicity has a technical analogue: in a precise sense, it is simpler 
to determine whether X is a cause of y according to the modified definition than it is for either 
the original or updated HP definitions. 

To make this precise, I need to review some basic notions of complexity theory. Formally, 
we are interested in describing the difficulty of determining whether a given x is an element 
of a set £ (sometimes called a Janguage). The language of interest here is 


Leause = {((M, t, y, X,#) : X = Zis a cause of y in (M, @)}. 


We will be interested in the difficulty of determining whether a particular tuple is in Leause- 

How do we characterize the difficulty of determining whether a particular tuple is in 
Leause? We expect the problem of determining whether a tuple (V, u, y, x JO) ISI Lsaiise 
to be harder if the tuple is larger; intuitively, the problem should be harder for larger causal 
models M than for smaller models. The question is how much harder. The complexity class 
P (polynomial time) consists of all languages £ such that determining if x is in £ can be done 
in time that is a polynomial function of the size of x (where the size of x is just the length of 
x, Viewed as a string of symbols). Typically, languages for which membership can be deter- 
mined in polynomial time are considered tractable. Similarly, PSPACE (polynomial space) 
consists of all languages for which membership can be determined in space polynomial in the 
input size, and EXPTIME consists of all languages for which membership can be determined 
in time exponential in the input size. 

The class NP consists of those languages for which membership can be determined in 
non-deterministic polynomial time. A language C is in NP if a program that makes some 
guesses can determine membership in £ in polynomial time. The classic language in NP is the 
language consisting of all satisfiable propositional formulas (i.e., formulas satisfied by some 
truth assignment). We can determine whether a formula is satisfiable by guessing a satisfying 
truth assignment; after guessing a truth assignment, it is easy to check in polynomial time that 
the formula is indeed satisfied by that truth assignment. 

The class co-NP consists of all languages whose complement is in NP. For example, the 
language consisting of all unsatisfiable propositional formulas is in co-NP because its com- 
plement, the set of satisfiable propositional formulas, is in NP. Similarly, the set of valid 
propositional formulas is in co-NP. 

It can be shown that the following is an equivalent definition of NP: a language CL is in 
NP iff there is a polynomial-time algorithm A that takes two arguments, x and y, such that 
x © L iff there exists a y such that A(z,y) = 1. We can think of y as representing the 
“guess”. For example, if £ is the language of satisfiable propositional formulas, « represents 
a propositional formula, y is a truth assignment, and A(x, y) = 1 if the truth assignment y 
satisfies x. It is easy to see that « is satisfiable iff there is some y such that A(x, y) = 1. 
Moreover, A is certainly polynomial time. Similarly, a language L is in co-NP iff there is a 
polynomial-time algorithm A that takes two arguments, x and y, such that « € CL iff, for all 
y, we have A(x, y) = 1. For validity, we can again take x to be a formula and y to be a truth 
assignment. Now « is valid iff A(x, y) = 1 for all truth assignments y. 
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The polynomial hierarchy is a hierarchy of complexity classes that generalize NP and co- 
NP. We take ©? to be NP. 2’ consists of all languages £ for which there exists a polynomial- 
time algorithm A that takes three arguments such that x € C iff there exists y, such that for 
all yo we have (7,41, y2) = 1. To get 53’, we consider algorithms that take four arguments 
and use one more alternation of quantifiers (there exists y; such that for all y2 there exists y3). 
We can continue alternating quantifiers in this way to define S/ for all k. We can similarly 
define ne , this time starting the alternation with a universal quantifier (for all y;, there exists 
Yo, such that for all ys ...). Finally, the set pr is defined as follows: 


DP ={£: 40), 05:0; € DF fel £= 04,9 L£5}. 


That is, Be consists of all languages (sets) that can be written as the intersection of a language 
in ue and a language in ney : 

We can think of NP as consisting of all languages £ such that if x © ZL, then there is a 
short proof demonstrating this fact (although the proof may be hard to find). Similarly, co- NP 
consists of those languages L such that if « ¢ L, then there is a short proof demonstrating 
this fact. Thus, D? consists of all languages £ such that for all x, there is a short proof of 
whichever of x € £and wz ¢ CL is true. oe generalizes these ideas to higher levels of the 
polynomial hierarchy. Classifying a language £ in terms of which complexity class it lies in 
helps us get an idea of how difficult it is to determine membership in £. It is remarkable how 
many naturally defined languages can be characterized in terms of these complexity classes. 

Note that if a complexity class C; is a subset of complexity class C2, then every language 
in C; is in C2; hence, C2 represents problems that are at least as hard as C,. It is not hard to 
show that 


PCNP(=?P)C...C UP C...C PSPACE C EXPTIME. 


It is known that P 4 EXPTIME (so that, in general, the problem determining set membership 
for a language in EXPTIME is strictly harder than the corresponding problem for a language 
in P, but it is not known whether any of the other inclusions is strict (although it is conjectured 
that they all are). 

We can get a similar chain of inequalities by replacing NP and oe by co-NP and ITZ’. 
Finally, it is not hard to show that 


De UNE C DE C De VME. 


For example, to see that uP = DE , take Ly in the definition of DD, to be the universal set, 
so that £1; 1 Ly = Ly. (The argument that gee Cc DT is essentially the same.) Thus, De 
represents problems that are at least as hard as each of 4 and DE but no harder than either 
of ea 41 and iy 41 It is perhaps worth noting that although, by definition, a language is in 
De if it can be written as the intersection of languages in ur and ne , this is very different 
from saying that DP = yr a np. 

We need one more set of definitions: a language CL is said to be hard with respect to a com- 
plexity class C (e.g., NP-hard, PSPACE-hard, etc.) if every language in C can be efficiently 
reduced to £; that is, for every language CL’ in C, an algorithm deciding membership in L’ 
can be easily obtained from an algorithm for deciding membership in £. More precisely, this 
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means that there is a polynomial-time computable function f such that « € L’ iff f(a) € L. 
Thus, given an algorithm F’, for deciding membership in £, we can decide whether x € L’ 
by running fF’, on input f(a). Finally, a language is said to be complete with respect to a 
complexity class C, or C-complete, if it is both in C and C-hard. It is well known that Sat, the 
language consisting of all satisfiable propositional formulas, is NP-complete. It easily follows 
that its complement Sat®°, which consists of all propositional formulas that are not satisfiable 
(i.e., negations of formulas that are valid), is co- NP-complete. 

Although the complexity classes in the polynomial hierarchy are somewhat exotic, they 
are exactly what we need to characterize the complexity of determining causality. The exact 
complexity depends on the definition used. The problem is easiest with the modified defi- 
nition; roughly speaking, this is because we do not have to deal with AC2(b°) or AC2(b"). 
Since, with the original definition, causes are always single conjuncts, the original definition 
is easier to deal with than the updated definition. Again, this shouldn’t be too surprising, since 
there is less to check. 

Before making these claims precise, there is another issue that must be discussed. Recall 
that when we talk about the complexity of determining whether X = isa cause of y in 
(M, a), formally, we consider how hard it is to decide whether the tuple (M/, @, y, X, @) is in 
the language £L.4,5-. In light of the discussion in Sections 5.1, we need to be careful about 
how M is represented. If we represent /V/ naively, then its description can be exponential in 
the signature S (in particular, exponential in the length of y, the number of endogenous vari- 
ables in S, and the cardinality of the range of these variables). It is not hard to see that all the 
relevant calculations are no worse than exponential in the length of y, the number of endoge- 
nous variables, and the cardinality of their ranges, so the calculation becomes polynomial in 
the size of the input. This makes the problem polynomial for what are arguably the wrong 
reasons. The size of the input is masking the true complexity of the problem. Here, I assume 
that the causal model can be described succinctly. In particular, I assume that succinct expres- 
sions for equations are allowed in describing the model, as I have been doing all along. Thus, 
for example, the equation for BS' in the “sophisticated” model Mp for the rock-throwing 
example is given as BS = BH V SH, rather than explicitly giving the value of BS' for each 
of the sixteen settings of BT, ST, BH, and SH. This is just a long-winded way of saying 
that I allow (but do not require) models to be described succinctly. Moreover, the description 
is such that it is easy to determine (i.e., in time polynomial in the description of the model), 
for each context vz, the unique solution to the equations. Listing the equations in the “affects” 
order determined by =<; can help in this task; the value of each endogenous variable can then 
be determined in sequence. 

With this background, I can state the relevant complexity results. 


Theorem 5.3.1 


(a) The complexity of determining whether X = @is a cause of in (M, ti) under the 
original HP definition is 2 -complete. 


(b) The complexity of determining whether X = @ is a cause of p in (M, ti) under the 
updated HP definition is D2 -complete. 
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(c) The complexity of determining whether X = @is a cause of in (M, ti) under the 
modified HP definition is DP -complete. 


The proofs of parts (a) and (b) of Theorem 5.3.1 are beyond the scope of this book (see 
the notes at the end of the chapter for references); the proof of part (c) can be found in 
Section 5.5.2. 

These results seem somewhat disconcerting. Even D?’-complete is considered intractable, 
which suggests that it may be hard to determine whether X = #is a cause of y in general. 
There are some reasons to believe that the situation is not quite so bad. First, in practice, there 
may be quite a bit of structure in causal models that may make the computation much simpler. 
The kinds of models that we use to get the hardness results are not typical. Second, there are 
two important special cases that are not so complicated, as the following result shows. 


Theorem 5.3.2 


(a) The complexity of determining whether X = «x is a cause of ip in (M, t) under the 
original HP definition is NP-complete in binary models (i.e., models where all variables 
are binary). 


(b) The complexity of determining whether X = «x is a cause of p in (M, t) under the 
modified HP definition is NP-complete. 


As we have observed, the restriction to singleton causes is made without loss of generality 
for the original HP definition. Although the restriction to binary models does entail some loss 
of generality, binary models do arise frequently in practice; indeed, almost all the examples 
in this book have involved binary models. The restriction to singleton causes does entail 
a nontrivial loss of generality for the modified HP definition. As we have seen, in voting 
scenarios and in the disjunctive case of the forest-fire example, causes are not singletons. 
Nevertheless, many examples of interest do involve singleton causes. 

This may not seem like major progress. Historically, even NP-complete problems were 
considered intractable. However, recently, major advances have been made in finding algo- 
rithms that deal well with many NP-complete problems; there seems to be some hope on this 
front. 


5.4 Axiomatizing Causal Reasoning 


The goal of this section is to understand the properties of causality in greater detail. The 
technique that I use for doing so is to characterize causal reasoning axiomatically. Before 
doing this, I first review some standard definitions from logic. 

An axiom system AX consists of a collection of axioms and inference rules. An axiom 
is a formula (in some predetermined language £), and an inference rule has the form “from 
P1,---, Yr infer w,” where y1,...,~%,~ are formulas in £. A proof in AX consists of a 
sequence of formulas in £, each of which is either an axiom in AX or follows by an application 
of an inference rule. A proof is said to be a proof of the formula vp if the last formula in the 
proof is y. The formula yy is said to be provable in AX, written AX F 9, if there is a proof 
of y in AX; similarly, ~ is said to be consistent with AX if —y is not provable in AX. 
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A formula y is valid in causal structure M, written M  , if (1, uv) & ¢ in all contexts 
u. The formula ¢ is valid in a set 7 of causal structures if M — y for all M € T; vis 
satisfiable in T if there is a causal structure MM € 7 and a context w@ such that (I, w) = yp 
Note that y is valid in 7 iff sy is not satisfiable in 7. 

The notions of soundness and completeness connect provability and validity. An axiom 
system AX is said to be sound for a language £ with respect to a set J of causal models if 
every formula in £ provable in AX is valid with respect to 7. AX is complete for £ with 
respect to 7 if every formula in £ that is valid with respect to 7 is provable in AX. 

The causal models and language that I consider here are parameterized by a signature 
S = (U,V,R). (Recall that U/ is the set of exogenous variables, V is the set of endogenous 
variables, and for each variable Y, R(Y) is the range of Y.) I restrict S here to signatures 
where Y is finite and the range of each endogenous variable is finite. This restriction cer- 
tainly holds in all the examples that I consider in this book. For each signature S, I consider 
Myrec(S), the set of all recursive causal models with signature S. The language that I consider 
is £(S), the language of causal formulas defined in Section 2.2.1, where the variables are re- 
stricted to those in V, and the range of these variables is given by R. L(S) is rich enough 
to express actual causality in models in M,..(S). More precisely, given X, é, y, and S, for 
each variant of the HP definition, it is straightforward to construct a formula ~ in this language 
such that (IM, i) | w for a causal model M € Myec(S) if and only if X = Z is a cause of 
y in (M, w). (The existential quantification implicit in the statement of AC2(a) and AC2(a™) 
can be expressed using a disjunction because the set V of endogenous variables is assumed 
to be finite; similarly, the universal quantification implicit in the statement of AC2(b°) and 
AC2(b") can be expressed using a conjunction.) Moreover, the language can express features 
of models, such as the effect of interventions. Thus, we will be able to see what inferences 
can be made if we understand the effect of various interventions. For example, results like 
Propositions 2.4.3 and 2.4.4 that provide sufficient conditions for causality to be transitive 
follow from the axioms. 

To help characterize causal reasoning in M,.¢(S), where S = (U,V, RR), it is helpful to 
define a formula Y ~ Z for Y,Z € VY, read “Y affects Z”. Intuitively, this formula is true 
in a model M/ and context w if there is a setting of the variables other than Y and Z such that 
changing the value of Y changes the value of 7. Formally, taking X=y- { Y, Z} (so that xX 
consists of the endogenous variables other than Y and Z), and taking R(X ) to be the range 
of the variables in X (so R(X) = XyexR(X)), Y ~ Z is an abbreviation of 


a Xe@Y cy(Z=2zA[X ZY cy |(Z=27 
Veer (i) veR(V),#e RZ) ae! ( ul( AL wt )) 
which is a formula in £(S). (In statisticians’ notation, this says that Y ~» Z is true in a 
context w if there exists y, y’, z, and z’ with z # z’ such that Z,,(u%) = z A Zzy(t) = 2’.) 
The disjunction is really acting as an existential quantifier here; this is one place I am using 
the fact that the range of each endogenous variable is finite. (With existential quantification 
in the language, I would not need this assumption.) 

Consider the following axioms CO-5 and rule of inference MP. (I translate the first four 
axioms into statisticians’ notation, omitting the context w since they are intended to hold for all 
contexts wv. The representation in this case is quite compact. However, standard statisticians’ 
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notation is not so well suited for translating axiom C6, although of course it could be extended 
to capture it.) 


CO. All substitution instances of propositional tautologies (see below). 


Cl. (Yegi(X =2) SY CP(X Fe) ife,c' CR(X),c Fa’ (equality) 
(.6,%g =o => Agee’ if ez 2’) 


C2. Veer(x) [Y «+ o(X =2) (definiteness) 
(i.e., Vner(x)(Xg = 2)) 


C3. ((X C F\(W =w) A [X — ZY =y)) > [X — ZW < wI|(Y = y) (composition) 
(1.e., (We =wAYr= y) => (Yew = y)) 


C4. [Xe a, We wl (X = x) (effectiveness) 
(ie., Xow = 2) 
C5. (Xo ~~ XA... AN Xp ~~ Xp) > A(X, ~ Xo) if X_ A Xo (recursiveness) 


C6. (a) [X + #Hy © 7[X + Z]p 


(b) [X + Z(pAd) & ([X — Hypa [X « Ay) (determinism) 
(c) [|X — (pv p) @ ((X « ayv [X « zy) 
MP. From y and y => w, infer w (modus ponens) 


To explain the notion of substitution instance in CO, note that, for example, p V —p is a 
propositional tautology; [Y + g|(X = x) V-[Y ¢ g(X = 2) (ie, Xg= av Xg # 2) 
is a substitution instance of it, obtained by substituting [Y < g](X = x) for p. More gener- 
ally, by a substitution instance of a propositional tautology y, I mean the result of uniformly 
replacing all primitive propositions in y by arbitrary formulas of £(S). C1 just states an ob- 
vious property of equality: if X = x in the unique solution of the equations in context wv of 
model My ep then we cannot have X = w’ if x’ # x. C2 states that there is some value 
x € R(X) that is the value of X in all solutions to the equations in context win My eg Note 
that the statement of C2 makes use of the fact that R(X) is finite (for otherwise C2 would 
involve an infinite disjunction). C3 says that if, after setting Xto#Y = yandW = w, 
then W = w if we both set X to Zand Y to y; this was proved in Lemma 2.10.2. C4 sim- 
ply says that in all solutions obtained after setting X to x, the value of X is x. It is easy to 
see that C5 holds in recursive models. For if (M,u) EH Y ~ Z, then Y xz Z. Thus, if 
(M,ti) —& Xp ~~ XL A... A Xp-1 ~ Xx, then Xo Xz Xp, $0 we cannot have X;, <z Xo 
if Xo A Xx. Thus, (M, wv) —E 7(X;, ~ Xo). Finally, the validity of C6 is almost immediate 
from the semantics of the relevant formulas. 

C5 can be viewed as a collection of axioms (actually, axiom schemes), one for each k. The 
case k = 1 already gives us =(Y ~» Z) V a(Z ~ Y) for all distinct variables Y and Z. That 
is, it tells us that, for any pair of variables, at most one affects the other. However, it can be 
shown that just restricting C5 to the case of k = 1 does not suffice to characterize My.c(S). 

Let AX+ec(S) consist of CO-C6 and MP. AX,¢-(S) characterizes causal reasoning in re- 
cursive models, as the following theorem shows. 
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Theorem 5.4.1 AX;ec(S) is a sound and complete axiomatization for the language L(S) in 
Myec(S). 


Proof: See Section 5.5.3. ff 


We can also consider the complexity of determining whether a formula y is valid in 
Merec(S) (or, equivalently, is provable in AX;ec(S)). This depends in part on how we for- 
mulate the problem. 

One version of the problem is to consider a fixed signature S and ask how hard it is to 
decide whether a formula y € L(S) is valid. This turns out to be quite easy, for trivial 
reasons. 


Theorem 5.4.2 If S is a fixed finite signature, the problem of deciding if a formula p € L(S) 
is valid in Myec(S) can be solved in time linear in |p| (the length of :p viewed as a string of 
symbols). 


Proof: If S is finite, there are only finitely many causal models in M,ec(S), independent 
of y. Given y, we can explicitly check whether ¢ is valid in any (or all) of them. This can 
be done in time linear in |y|. Since S is not a parameter to the problem, the huge number of 
possible causal models that we have to check affects only the constant. ff 


We can do even better than Theorem 5.4.2 suggests. Suppose that V consists of 100 vari- 
ables and y mentions only 3 of them. A causal model must specify the equations for all 100 
variables. Is it really necessary to consider what happens to the 97 variables not mentioned 
in ~ to decide whether ¢ is satisfiable or valid? As the following result shows, we need to 
check only the variables that appear in y. Given a signature S = (U,V,R) and a context 
u EU, let Sy ¢ = (U*, Vo, Ry,z), where V., consists of the variables in V that appear in y, 
Ryi(U*) = tu, and Ry a(Y) = R(Y) for all variables Y € Vy. 


Theorem 5.4.3 A formula yp € L(S) is satisfiable in Myec(S) iff there exists % such that yp 
is satisfiable in Myec(Sy,it): 


Proof: See Section 5.5.3. ff 


Theorem 5.4.2 is the analogue of the observation that for propositional logic, the satisfia- 
bility problem is in linear time if we restrict to a fixed set of primitive propositions. The proof 
that the satisfiability problem for propositional logic is NP-complete implicitly assumes that 
we have an unbounded number of primitive propositions at our disposal. There are two ways 
to get an analogous result here. The first is to allow the signature S to be infinite; the second 
is to make the signature part of the input to the problem. The results in both cases are similar, 
so I just consider the case where the signature is part of the input here. 


Theorem 5.4.4 Given as input a pair (vy, S), where p € L(S) and S is a finite signature, 
the problem of deciding if p is satisfiable in Myec(S) is NP-complete. 


Proof: See Section 5.5.3. ff 
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Thus, the satisfiability problem for the language of causality is no harder than it is for 
propositional logic. Since satisfiability is the dual of validity, the same is true for the validity 
problem (which is co-NP-complete). This can be viewed as good news. The techniques 
developed to test for satisfiability of propositional formulas can perhaps be applied to testing 
the satisfiability of causal formulas as well. 


5.5 Technical Details and Proofs 


In this section, I go through some of the technical details that I deferred earlier. 


5.5.1 Algebraic plausibility measures: the details 


In this section, I fill in the technical details of the construction of algebraic plausibility mea- 
sures. I start with the formal definition. 


Definition 5.5.1 An algebraic conditional plausibility measure (algebraic cpm) Pl on W 
maps pairs of subsets of W to a domain D that is endowed with operations © and ®, defined 
on domains Dom(@) and Dom(®), respectively, such that the following properties hold: 


Algl. If U, and U2 are disjoint subsets of W and V # @, then 
PI(U; UU2 | V) = Pl(U, | V) @ Pl(U2 | V+). 


Alg2. IU CV CV’ and V £0, then PI(U | V’) = PI(U| V) @ P(V | VV’). 


Alg3. © distributes over ©; more precisely, a@ (b1 ®-:-@bn) = (4@b1) @--- GP (a@b,,) if 
(a, b1),..-,(@, bn), (4, b1®:--Pby) € Dom(@) and (b1,..., bn), (@@bi,...,a@by) € 
Dom(®), where Dom(®) = {(PI(U; | V),...,Pl(Un | V)) : Ui,..., Un are pairwise 
disjoint and V 4 0} and Dom(@) = {(PI(U | V), PU(V | V’)): U CV CV’, V 49}. 
(The reason that this property is required only for tuples in Dom(@) and Dom(®) is dis- 
cussed shortly. Note that parentheses are not required in the expression b} @--- @ by, 
although, in general, © need not be associative. This is because it follows immediately 
from Alg1 that @ is associative and commutative on tuples in Dom(®).) 


Alg4. If (a,c), (b,c) € Dom(®),a®c<b@c,andcF 1, thena < b. | 


The restrictions in Alg3 and Alg4 to tuples in Dom(@) and Dom(®) make these conditions a 
little more awkward to state. It may seem more natural to consider a stronger version of, say, 
Alg4 that applies to all pairs in D x D. Requiring Alg3 and Alg4 to hold only for limited 
domains allows the notion of an algebraic plausibility measure to apply to a (usefully) larger 
set of plausibility measures. 

Roughly speaking, Dom(®) and Dom(@®) are the only sets where we really care how ® 
and ® work. We use & to determine the (conditional) plausibility of the union of two disjoint 
sets. Thus, we care about a © b only if a and 6 have the form Pl(U; | V) and Pl(U2 | 
V), respectively, where U, and U2 are disjoint sets, in which case we want a © b to be 
Pl(U, U U2 | V). More generally, we care about a; ®---@a,, only if a; has the form P1(V; | 
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V), where U,,...,U,, are pairwise disjoint. Dom(@) consists of precisely these tuples of 
plausibility values. Similarly, we care about a @ b only if a and b have the form Pl(U | V) 
and Pl(V | V’), respectively, where U C V C V’, in which case we want a ® b to be 
P1l(U | V). Dom(@) consists of precisely these pairs (a, b). Restricting @ and @ to Dom(@) 
and Dom(®) will make it easier for us to view a partial preorder as an algebraic plausibility 
measure. Since @ and ® are significant mainly to the extent that Alg1 and Alg2 hold, and Alg1 
and Alg2 apply to tuples in Dom(®) and Dom(®), respectively, it does not seem unreasonable 
that properties like Alg3 and Alg4 be required to hold only for these tuples. 

In an algebraic cpm, we can define a set U to be plausibilistically independent of V con- 
ditional on V' if VO V' # 0 implies that P(U | VN VV’) = PI(U | V’). The intuition here 
is that learning V does not affect the conditional plausibility of U given V’. Note that con- 
ditional independence is, in general, asymmetric. U can be conditionally independent of V 
without V being conditionally independent of U. Although this may not look like the standard 
definition of probabilistic conditional independence, it is not hard to show that this definition 
agrees with the standard definition (that Pr(U NV | V’) = Pr(U | V’) x Pr(V | V’)) in 
the special case that the plausibility measure is actually a probability measure, provided that 
Pr(V NV’) # 0. Of course, in this case, the definition is symmetric. 

The next step is to show how to represent a partial preorder on worlds as an algebraic 
plausibility measure. Given an extended causal model M = (S,F,>), define a preorder 
=* on subsets of W just as in Section 3.5: U =° V if, for all w € V, there exists some 
w’ € U such that w’ > w. Recall that >° extends the partial preorder = on worlds. We might 
consider getting an unconditional plausibility measure Pl- that extends the partial preorder 
on worlds by taking the range of Pl; to be subsets of W, defining Pl, as the identity (i.e., 
taking Pl- (U) = U), and taking U > V iffU >° V. 

This almosts works. There is a subtle problem though. The relation > used in plausibility 
measures must be a partial order, not a partial preorder. Recall that for > to be a partial order 
ona set X, if x > 2’ and x’ > x, we must have 2’ = 7; this is not required for a partial 
preorder. Thus, for example, if w > w’ and w’ > w, then we want Pl, ({w}) = Pl, ({w’}). 
This is easily arranged. 

Define U = V ifU =° V andV &° U. Let [U] = {U’ CW: U' SU}; that is, [U] 
consists of all sets equivalent to U according to =°. We call U the equivalence class of U. 
Now define Pl- (U) = [U], and define an order > on equivalence classes by taking [U] > [V] 
iff U’ =* V’ for some U’ € [U] and V’ € [V]. It is easy to check that > is well defined on 
equivalence classes (since if U’ =* V’ for some U’ € [U] and V’ € [V], then U’ >* V’ for 
all U’ € [U] and V’ € [V]—for all w € V’, there must be some w’ € U’ such that w’ > w) 
and is a partial order. 

Although this gives us an unconditional plausibility measure extending ~, we are not quite 
there yet. We need a conditional plausibility measure and a definition of © and ®. Note that 
if U;,U2 © [U], then Uy U U2 € [U]. Since W is finite, it follows that each set [U] has a 
largest element, namely, the union of the sets in [U]. 

Let D be the domain consisting of L, T, and all elements of the form djqyj\;yj for all [U] 
and [V] such that the largest element in [U] is a strict subset of the largest element in [V]. 
Intuitively, djcj\;v] represents the plausibility of U conditional on V; the definition given 
below enforces this intuition. Place an ordering > on D by taking 1 < djuyjpy) < T and 
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div) = duq\vy if [V] = [V’] and U’ =* U. D can be viewed as the range of an algebraic 
plausibility measure, defined by taking 


a8 ifUNV =6 
PI-(U|V)=4 T ifUNV=V 
dunvy|tv) otherwise. 


It remains to define 6 and ® on D so that Alg1 and Alg2 hold. This is easy to do, in large part 
because © and ® must be defined only on Dom(®) and Dom(®), where the definitions are 
immediate because of the need to satisfy Alg1 and Alg?2. It is easy to see that these conditions 
and the definition of Pl guarantee that Pl1-4 hold. With a little more work, it can be shown 
that these conditions imply Alg3 and Alg4 as well. (Here the fact that Alg3 and Alg4 are 
restricted to Dom(@) and Dom(®) turns out to be critical; it is also important that U = V 
implies that U = U UV.) I leave details to the interested reader. 
This construction gives an algebraic cpm, as desired. 


5.5.2 Proof of Theorems 5.3.1(c) and 5.3.2(b) 


In this section, I prove that the complexity of determining of X = Z isa cause of y in 
(M, v) is DP complete under the modified HP definition. Formally, we want to show that the 
language 


L={(M,i,p,X, 2) : X = satisfies AC1, AC2(a™), and AC3 for y in (M, z)} 
is DP-complete. Let 


satisfies AC1 and AC2(a”) for y in (M, i)}, 


Lac = {(M,i,9,X,z):X =2 
: X = £ satisfies ACI and AC3 for y in (M, v)}. 


Lac3 = {(M,i,¢,X,2) 


Clearly L= Laco NM Lac3- 

It is easy to see that Lac is in NP and Lac is in co-NP. Checking that AC] holds can 
be done in polynomial time. To check whether AC2(a’””) holds, we can guess W and 2’, and 
check in polynomial time that (M,i) E [X < #,W < w*]- (where «* is such that 
(M, ti) = W = w"*). Finally, checking whether AC3 is not satisfied can be done by guessing 
a counterexample and verifying. Since L = Lacg MN Lac3, it follows by definition that L is 
in Df, 

To see that L is DP-complete, first observe that the language Sat x Sat®, that is, the lan- 
guage consisting of pairs (w, w’) of formulas such that ~ is satisfiable and w’ is not satis- 
fiable, is DP-complete. For if £(Prop) is the set of formulas of propositional logic, then 
Sat x Sat® is the intersection of a language in NP (Sat x £(Prop)) and a language in co-NP 
(L(Prop) x Sat®), so it is in DP. To show that Sat x Sat® is DP-hard, suppose that L’ is an 
arbitrary language in DP. There must exist languages £; and £2 in NP and co-NP, respec- 
tively, such that £2’ = £1; Lo. Since Sat is NP-complete, there must exist a polynomial- 
time computable function f reducing £, to Sat; that is, « € L, iff f(a) © Sat. Similarly, 
there must be a polynomial-time function g reducing £2 to Sat®. The function (f,g) map- 
ping x to (f(x), g(x)) gives a polynomial-time reduction from L’ to Sat x Sat: « € L’ iff 
(f(x), g(x)) € Sat x Sat®. 
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Thus, to show that £ is DP-complete, it suffices to reduce Sat x Sat® to £. Inow map a pair 
(w, w’) of formulas to a tuple (M, uv, Z = 0, (Xo, X1), (0, 0)) such that (w, w’) € Sat x Sat® 
iff Xo = 0A X, = Oisa cause of Z = 0 in (M, %) according to the modified HP definition. 

We can assume without loss of generality that ~ and w’ involve disjoint sets of variables 
(if not, rename all the variables in ~’ so that they do not appear in w; this can clearly be 
done without affecting whether w” is satisfiable). Suppose that the variables that appear in ~ 


and w’ are included in Y,,..., Y,. Consider the causal model M with endogenous variables 
Xo0,X1,Y1,---,¥n, Z, one exogenous variable U, and the following equations: 

a Xo = U, 

# X, =U, 


a Y; = Xo for? = 1,...,n, and 
Ll Z=(Xo=1VAVA(K=1VV’). 


I claim that (a) if w is not satisfiable, then there is no cause of Z = 0 in (IV, 0) according 
to the modified HP definition; (b) if both y and 7” are satisfiable, then Xo = 0 is a cause of 
Z = 0 in (M,0); and (c) if 2 is satisfiable and ~ is unsatisfiable, then X) = 0A X, = 0 
is a cause of Z = 0 in (M,0). Clearly, (a) holds; if w is unsatisfiable, then no setting of 
the variables will make Z = 1. For (b), suppose that ~ and w’ are both satisfiable. Since 
w and w’ involve disjoint sets of variables, there is a truth assignment a to the variables 
Y,,...,Yp that makes w A w’ true. Let W be the subset of variables in {¥i,..., Y,} that 
are set to 0 (ie., false) in a. It follows that (M,0) - [Xo + 1,W + O\(~A VW’), so 
(M,0) — [Xo © 1,W + O](Z = 1) and AC2(a”) holds. AC3 is immediate. Thus, 
Xo = 0 is indeed a cause of Z = 0, as desired. Finally, for (c), if 7)’ is unsatisfiable, then 
it is clear that neither X> = 0 nor X; = 0 individually is a cause of Z = 0 in (/,0); to 
have Z = 1, we must set both Xg = 1 and X; = 1. Moreover, the same argument as in 
part (b) shows that, since w is satisfiable, we can find a subset W of {Y1,..., Y,} such that 
(M,0) E [Xo — 1,W < O]y), so (M,0) E [Xo + 1,X, © 1,W < Ol(Z = 1). This 
shows that Xo = 1 A X, = lis acause of Z = 0 iff (w,w’) € Sat x Sat®. This completes 
the proof of Theorem 5.3.1(c). 

The proof of Theorem 5.3.2(b) is almost immediate. Now we want to show that the lan- 
guage 


L' = {(M, ii, yp, X,#) : X = x satisfies ACI, AC2(a”), and AC3 for y in (M, z)} 
is NP-complete. AC3 trivially holds and, as we have observed, checking that AC] and 


AC2(a™) holds is in NP. Moreover, the argument above shows that AC2(a”’) is NP-hard 
even if we consider only singleton causes. 


5.5.3. Proof of Theorems 5.4.1, 5.4.2, and 5.4.4 


In this section, I prove Theorems 5.4.1, 5.4.2, and 5.4.4. I repeat the statements for the reader’s 
convenience. 
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Theorem 5.4.1: AX;e¢(S) is a sound and complete axiomatization for the language L(S) in 
Myrec (S)). 


Proof: The fact that C1, C2, C4, and C6 are valid is almost immediate; the soundness of C3 
was proved in Lemma 2.10.2; and the soundness of C5 is discussed in Section 5.4 before the 
statement of the theorem. 

For completeness, it suffices to prove that if a formula y in L(S) is consistent with 
AXrec(S) (ie., its negation is not provable), then ¢ is satisfiable in M,e-(S). For suppose 
that every consistent formula is satisfiable and that y is valid. If y is not provable, then =p 
is consistent. By assumption, this means that —y is satisfiable, contradicting the assumption 
that v is valid. 

So suppose that a formula y € L(S), with S = (U,V,R), is consistent with AXyec(S). 
Consider a maximal consistent set C’ of formulas that includes y. (A maximal consistent set 
is a set of formulas whose conjunction is consistent such that any larger set of formulas would 
be inconsistent.) It follows easily from standard propositional reasoning (i.e., using CO and 
MP only) that such a maximal consistent set exists. Moreover, it is well known that a maximal 
AXyec(S)-consistent set C’ has the following properties: 


= for each formula ~ € L(S), either wy € C or =y € C; 
»wAw' €Ciffy € Candy’ €C; 

ewVvw' € Ciffy € Cory’ EC; 

* each instance of an axiom in AX;ec(S) is in C. 


(See the notes for references for this standard result.) Moreover, from C1 and C2, it follows 
that for each variable X € V and vector y of values for Y, there exists exactly one element 
a € R(X) such that [Y + g](X = 2) (ie., in statisticians’ notation, Xz =x € C). Inow 
construct a causal model M = (S,F) € Mrec(S) and context w such that (M,w) — w for 
every formula 7 € C' (and, in particular, vy). 

The idea is straightforward: the formulas in C’ determine F. For each variable X € Y, let 
Yx = =V- Ce By Cl and C2, for all 7 € R(Yx), there is a unique x € R(X) such that 


[Yx < g|(X = 2) € C. Forall contexts 7 € RU), define Fx (¥, @) = x. (Note that Fy is 
independent of the context w.) This defines F'y for all endogenous variables X, and hence F. 
Let M = (S,F). 


I claim that for all formulas y € £L(S), we have w € C iff (M,t) — w for all contexts 
u. Before proving this, I need to deal with a subtle issue. The definition of was given in 
Section 2.2.1 under the assumption that JV is recursive (1.e., M € Myec). Although I will 
show this, I have not done so yet. Thus, for now, I take (M, i) = [Y < g]w to hold if there 
is a unique solution to the equations in the model My Fi in context uv, and w holds in that 
solution. This definition is consistent with the definition given in Section 2.2.1 and makes 
sense even if M ¢ M,.-(S). (Recall that, in Section 2.7, I gave a definition of | that works 
even in nonrecursive models. I could have worked with that definition here; it is slightly easier 
to prove the result using the approach that I have taken.) 

I first prove that » € C iff (M, i) / w for formulas ~ of the form [Y «+ g](X = 2). 
The proof is by induction on |V| — |Y|. First consider the case where |V| — ¥) = 0. 
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In that case, X € Y. If [Y « g\(X = 2) € C, then the value of each endoge- 
nous variable is determined by its setting in 7, so there is clearly a unique solution to 
the equations in My eg for every context uv and X = « in that solution. Conversely, if 


My.gF [Y < g(X =~2), then [Y < g|(X = 2) is an instance of C4, so must be in C. 


If |\V| — |Y| = land [Y «+ g|(X = x) © C, then again there is a unique solution to 
the equations in My eg for each context; the value of each variable Y € Y is determined by 


y, and the value of the one endogenous variable W not in Y is determined by the equation 
Fy. If [Y <— Y|(X =a) € Cand X € Y, then the fact that X = « in the unique solution 
follows from C4, while if X ¢ Y, the result follows from the definition of Fy. Conversely, if 


My, 3 F [Y < g](X =x) and X € Y then again [Y < g](X = x) must be an instance of 


C4, so must be in C, while if X ¢ Y, then we must have [Y ¢ g](X = x) € C, given how 
Fy is defined. 


For the general case, suppose that |V| — |Y| = k > land [Y < g|(X = x) € C. I want 
to show that there is a unique solution to the equations in My eg in every context @ and that 
in this solution, X has value x. To show that there is a solution, I define a vector v and show 
that it is in fact a solution. If W € Y and W + wisa conjunct of en y, then set the W 
component of v to w. If W is not in Y, then set the W component of @ to the unique value 
w such that [Y < g](W = w) € C. (By Cl and C2 there is such a unique value w.) I claim 
that v is a solution to the equations in My eg for all contexts w. 


To see this, let V,; be a variable in V — Y, let V2 be an arbitrary variable in Y, and let 
v1 and v2 be the values of these variables in 7. By assumption, [Y < 9](Vi = v1) € C, 
[Y < g(V2 = v2) € C, and C contains every instance of C3, so it follows that 
ly + 9,Vi — v1](V2 = v2) € C. Since V2 was arbitrary, it follows from the induction hy- 
pothesis that v is the unique solution to the equations in My: een for all contexts wv. For 
every endogenous variable Z other than Vj, the equation F’z for Z is the same in My, _, and 
in My aay Thus, every equation except possibly that for V, is satisfied by v in My eg 
for all contexts w. Since |V — Y| > 2, we can repeat this argument starting with a variable in 
V —Y other than Vj to conclude that, in fact, every equation in My eg is satisfied by v for all 


contexts w. That is, vis a solution to the equations in My _; for all contexts u. 
It remains to show that v is the unique solution to the equations in My eg for all contexts 


uw. Suppose there were another solution, say 0’, to the equations in My eg for some context 
u. There must be some variable V; whose value in v is different from its value in ’. Suppose 
that the value of V, in @ is v; and its value in ov” is vu}, with v; # v}. By construction, 
[YY «—g(VU = v1) € C. By Cl, [Y — g(MU F v}) € C. Since v” is a solution to the 
equations in My eg in context i, it is easy to check that v” is also a solution to the equations 


se : : . be aap ; : 
in My epvie context w. By the induction hypothesis, v” is the unique solution to the 
equations in M~; , in every context. Let V2 be a variable other than V; in V—Y, and let 


YeyVievy 
vg be the value of V2 inv’. As above, v” is the unique solution to the equations in My. ee 
V2 2 
in all contexts. It follows from the induction hypothesis that [Y < ¥,Vi < v{](V2 = v5) and 


[Y + 9, Vo — v5](V, = v1) are both in C. I now claim that [Y ¢ g](V; = v}) € C. This, 
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together with the fact that [Y + g|(Vi 4 vu) € C (and hence -[Y + g|(V; = v}) € C by 
C6(a)), contradicts the consistency of C’. 

To prove that [Y < g|(V; = v) € C, note that by C5, at most one of Vj ~+ V2 and 
Vo ~ Vi isin C. If Va ~ Vi € C, then 7=(V2 ~ Vi) € C, so V2 does not affect Vi. Since 
[Y + ¥, Vo — ob ](Vi = v1) € C, it follows that [Y < g](V; = v),) € C, proving the claim. 

Now suppose that Vj ~+ V2 ¢ C. Then an argument analogous to that above shows that 
Y — as = U2) eC. [Y —g\(V =v) ¢ C, by C2, [Y + o|(Yi = vl) € C for 
some v/ # vu}. Now applying C3, it follows that [Y + 7,V2 < v4](Vi = v/) € C. But 
this, together with the fact that [Y + 7,V2 < v](Vi = vu) € C and Cl, contradicts the 
consistency of C’. This completes the uniqueness proof. 

For the converse, suppose that (IM, i)  [Y < g|(X = x). Suppose, by way of contra- 
diction, that [Y + g|(X = x) ¢ C. Then, by C2, [Y + gj](X = 2’) € C for some a! F x. 
By the argument above, (M, it) — [Y < y](X = 2’), a contradiction. 

The rest of this proof is straightforward. I next show that for all formulas w of form 
[Y < glu’, » € C iff (M,@) / ¥ for all contexts a. The proof is by induction on the 
structure of ~)’. Since w’ is a Boolean combination of formulas of the form X = x, for the 
base case, y’ has the form X = x. This was handled above. If 7)’ has the form —=w”’, then 


pec 
iff A[Y — gl” eC [by C6(a)] 
iff Yc" ¢éC 
iff (M,i) / [Y < gl” forall contexts Z —_ [by the induction hypothesis] 
iff (M,a@) K-7[Y < gjw" for all contexts & 
iff (M,t) KE [Y < g|-w” for all contexts w. 


If ~’ has the form w, A We, then 


pec 
iff [Vo gh AY « ge €C [by C6(b)] 
iff [Y <q yjdi € Cand [Y ¢+ gly2 EC 

(M, i) K [Y © gh A[Y < gv» for all contexts @ [by the induction hypothesis] 
iff (M, i) E [Y < §](v1 A Y2) for all contexts w. 


S 
wa 


nN 


The argument if 7)’ has the form w1 V we is similar to that for wy, A we, using axiom C6(c), 
and is left to the reader. 

Finally, we need to consider the case that 7 is a Boolean combination of formulas of the 
form [Y + yu’. We again proceed by induction on structure. The argument is similar to that 
above and is left to the reader. 

There is one more thing that needs to be checked: that MZ € M,.-(S). Since every instance 
of C5 is in C, every instance of C5 is true in (V, i’) for all contexts i. We just need to observe 
that this guarantees that JV is recursive. Define a relation <’ on the endogenous variables by 
taking X ~' Y if (M,w) — X ~ Y. It is easy to see that <’ is reflexive; we must have 
(M, ti) — X ~ X. Let X be the transitive closure of =<’: that is, < is the smallest transitive 
relation (“smallest” in the sense of relating the fewest pairs of elements) that includes x’. It 
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is well known (and easy to check) that X = Y iff there exists Xo,..., X, such that X = Xo, 
Y = X,, Xo x! X1, X1 A! Xo, ..., Xn—-1 2X! Xp. (The relation ~’ defined this way 
extends ~ and is easily seen to be transitive; moreover, any transitive relation that extends < 
must also extend ~<’.) 

By construction, = is reflexive and transitive. Axiom C5 guarantees that it is anti- 


symmetric. For suppose, by way of contradiction, that X <x Y and Y x X for dis- 


tinct variables X and Y. Then there exist variables Xo,...,X, and Yo,..., Yi such that 
X= Xo =Y¥n,Y =X, = Yo, Xo = Kay se2y eat = Xn, Yo = Vigeces nat = bore 
But then 


(M, wt) E Xo~ Xi A...N Xn-1 ~~ Xn AXn~ Vi A...A\¥m-1~ Ym- 


Finally, the definition of ~+ guarantees that X affects Y iff (M,u) FE X ~ Y in some 
context u. Thus, M € My,e-(S), as desired (and, in fact, the same “affects” ordering < on 
endogenous variables can be used in all contexts). If 


Theorem 5.4.3: A formula yp € L(S) is satisfiable in Myec(S) iff there exists ti such that yp 
is satisfiable in Myec(Sy,it)- 


Proof: Clearly, if a formula is satisfiable in M;o-(S.,z), then it is satisfiable in Myec(S). 
We can easily convert a causal model M = (Sy, F) € Mrec(Sy,z) Satisfying y to a causal 
model M’ = (S,F’) € Myec(S) satisfying y by simply defining Fy, to be a constant, 
independent of its arguments, for X € V — V,; if X € Vy, define Fy (w’, Z, y) = Fx (u, 2), 
where & € Xvevy,-{xpR(Y) and Vy E Xyve(v_v,) RY). 

For the converse, suppose that is satisfiable in a causal model M = (S,F) € Myec(S). 
Thus, (/, tv) — vy for some context w, and there is a partial ordering x; on the variables in 
Y such that unless Y x; X, F’x is independent of the value of Y in context w. 

This means that we can view F'y as a function of w and the variables Y € V such that 
Y =; X. Let Pre(X) = {Y © V: Y xg X}. For convenience, I allow F'x to take as 
arguments only i and the values of the variables in Pre(X ), rather than requiring its arguments 
to include the values of all the variables in ¢ UV — {X}. Now define functions Fy, : {ti} x 
(Xvew,-{xpR(Y)) 4 R(X) for all X € V by induction on =<; that is, start by defining 
FY for the <g-minimal elements X, whose value is independent of that of all the other 
variables, and then work up the <¢ chains. Suppose X € Y,, and 7 is a vector of values for 
the variables in V,, — {X}. If X is <z-minimal, then define Fy (v7, #/) = Fx (w). In general, 
define F', (v, 7) = Fx (u, Z), where 7 is a vector of values for the variables in Pre(X ) defined 
as follows. If Y € Pre(X)M Vp, then the value of the Y component in 7 is the value of the 
Y component in y; if Y € Pre(X) — Vy, then the value of the Y component in 7 is FY-(u, y). 
(By the induction hypothesis, FY-(w, y) has already been defined.) Let M’ = (Syz, F’). It is 
easy to check that M’ € Myrec (Sy) (the ordering of the variables is just Xz restricted to V,). 
Moreover, the construction guarantees that if xX C Vz, then in context v, the solutions to the 
equations My, 3 and My, ; are the same when restricted to the variables in V,. It follows 


that (’, i) E vy. El 


Theorem 5.4.4: Given as input a pair (p,S), where p € L(S) and S is a finite signature, 
the problem of deciding if p is satisfiable in Myec(S) is NP-complete. 
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Proof: The NP lower bound is easy; there is an obvious way to encode the satisfiability 
problem for propositional logic into the satisfiability problem for £(S). Given a proposi- 
tional formula y with primitive propositions p;,..., pz, letS = (0, {X1,..., Xz}, R), where 
R(X;) = {0,1} for? = 1,...,k. Replace each occurrence of the primitive proposition p; in 
y with the formula X; = 1. This gives us a formula y’ in £(S). It is easy to see that y’ is 
satisfiable in a causal model M € M,e<(S) iff y is a satisfiable propositional formula. 

For the NP upper bound, suppose that we are given (y,S) with py € L(S). We want to 
check whether ¢ is satisfiable in M,e-(S). The basic idea is to guess a causal model M 
and verify that it indeed satisfies y. There is a problem with this though. To completely 
describe a model /, we need to describe the functions F'y. However, there may be many 
variables X in S, and they can have many possible inputs. As we have seen, just describing 
these functions may take time much longer than polynomial in y. Part of the solution to this 
problem is provided by Theorem 5.4.3, which tells us that it suffices to check whether ¢ is 
satisfiable in My. (S,). In light of this, for the remainder of this part of the proof, I assume 
without loss of generality that S = S,,. Moreover, the proof of Theorem 5.4.1 shows that if a 
formula is satisfiable at all, it is satisfiable in a model where the equations are independent of 
the exogenous variables. 

This limits the number of variables that we must consider to at most |p| (the length of the 
formula y, viewed a string of symbols). But even this does not solve the problem completely. 
Since we are not given any bounds on |R(Y)| for variables Y in S,, even describing the 
functions Fy for the variables Y that appear in y on all their possible input vectors could take 
time much more than polynomial in y. The solution is to give only a short partial description 
of a model M and show that this suffices. 

Let R be the set of all assignments Ye y that appear in vy. Say that two causal models 
and M’ in Myec(S..) agree on R if, for each assignment y+ y € R, the (unique) solutions 
to the equations in My eg and My 5 in context w are the same. It is easy to see that if MW and 


M’ agree on R, then either both M and M’ satisfy y in context w or neither do. That is, all 
we need to know about a causal model is how it deals with the relevant assignments—those 
in R. 

For each assignment Ye y € R, guess a vector u(Y + y) of values for the endogenous 
variables; intuitively, this is the unique solution to the equations in My eg in context wv. Given 
this guess, it is easy to check whether ¢ is satisfied in a model where these guesses are the 
solutions to the equations. It remains to show that there exists a causal model in M,e<(S) 
where the relevant equations have these solutions. 

To do this, we first guess an ordering < on the variables. We can then verify whether the 
solution vectors a(Y + ¥) guessed for the relevant equations are compatible with <, in the 
sense that it is not the case that there are two solutions @ and v” such that some variable X 
takes on different values in v and v’, but all variables Y such that Y = X take on the same 
values in U and w’. It is easy to see that if the solutions are compatible with <, then we can 
define the functions F'y for X € VY such that all the equations hold and F'y is independent of 
the values of Y if X < Y forall X,Y € V. (Note that we never actually have to write out the 
functions F'x, which may take too long; we just have to know that they exist.) To summarize, 
as long as we can guess some solutions to the relevant equations such that a causal model 
that has these solutions satisfies yy, and an ordering < such that these solutions are compatible 
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with ~<, then ¢ is satisfiable in Myec(S,,). Conversely, if y is satisfiable in M € Myec(Sy), 
then there clearly are solutions to the relevant equations that satisfy ~~ and an ordering < such 
that these solutions are compatible with <. (We just take the solutions and the ordering < 
from /.) This shows that the satisfiability problem for Myec (Sy) is in NP, as desired. §j 


Notes 


The material in Section 5.1 is largely taken from [Halpern and Hitchcock 2013]. References to 
the literature on Bayesian networks are given in the notes to Chapter 2. Plausibility measures 
and conditional plausibility measures were defined by Friedman and Halpern [1995]. The 
notion of algebraic cpm was introduced in [Halpern 2001], where its application to Bayesian 
networks is also discussed. Specifically, it is shown there how the “technology” of Bayesian 
networks applies to representations of uncertainty other probability measures, as long as the 
representation can be viewed as an algebraic cpm. 

Glymour et al. [2010] suggest that trying to understand causality through examples is hope- 
less because the number of models grows exponentially large as the number of variables 
increases, which means that the examples considered in the literature represent an “infinites- 
imal fraction” of the possible examples. (They use much the same combinatorial arguments 
as those outlined at the beginning of the chapter.) They consider and dismiss a number of ap- 
proaches to restricting the number of models. Interestingly, although they consider Bayesian 
networks (and advocate their use), they do not point out that by restricting the number of par- 
ents of a node, we get only polynomial growth rather than exponential growth in the number 
of models. That said, I agree with their point that it does not suffice to just look at examples. 
See Chapter 8 for more discussion of this point. 

Pearl [2000] first discusses causality using causal networks and only then considers defini- 
tions of causality using structural equations. As I mentioned in the text, the causal network just 
gives information regarding the (in)dependencies between variables; the equations in a causal 
model give more information. However, we can augment a (qualitative) Bayesian network 
with what are called conditional probability tables. These are just the probabilistic analogue 
of what was done in Table (5.2) in Example 5.2.1: we give the conditional probability of 
the value of each variable in the Bayesian network given each possible setting of its parents. 
(Thus, for variables that have no parents, we give the unconditional probability of each value.) 
In the deterministic case, these probabilities are always either 0 or 1. In this special case, the 
conditional probability tables can be viewed as defining structural equations; for each variable 
X, they determine the equation F'y characterizing how the value of X depends on the value 
of each of the other variables. (Of course, since the value of X is independent of variables 
that are not its parents, we need to define only how the value of X depends on the values 
of its parents.) Thus, a qualitative Bayesian network augmented with conditional probability 
tables (which is called a quantitative Bayesian network in [Halpern 2003]) can be viewed as 
defining a causal model. 

Sipser [2012] gives an excellent introduction to complexity theory, including the fact that 
Sat is NP-complete (which was originally proved by Cook [1971]). The polynomial hierarchy 
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was defined by Stockmeyer [1977]; the complexity class DP was defined by Papadimitriou 
and Yannakakis [1982]; this was generalized to DF by Aleksandrowicz et al. [2014]. Theo- 
rem 5.3.1(a) and Theorem 5.3.2(a) were proved by Eiter and Lukasiewicz [2002], using the 
fact (which they also proved) that causes in the original HP definition are always singletons; 
Theorem 5.3.1(b) was proved by Aleksandrowicz et al. [2014]; Theorem 5.3.1(c) and Theo- 
rem 5.3.2(b) were proved in [Halpern 2015a]. The fact that many large NP-complete problems 
can now be solved quite efficiently is discussed in detail by Gomes et al. [2008]. 

The discussion of axiomatizations is mainly taken from [Halpern 2000]. The axioms con- 
sidered here were introduced by Galles and Pearl [1998]. The technical results in Section 5.4 
were originally proved in [Halpern 2000], although there are some slight modifications here. 
An example is also given there showing that just restricting C5 to the case of k = 1 does not 
suffice to characterize M,o.(S). The technique of using maximal consistent sets used in the 
proof of Theorem 5.4.1 is a standard technique in modal logic. A proof of the properties of 
maximal consistent sets used in the proof of Theorem 5.4.1, as well as more discussion of this 
technique, can be found in [Fagin, Halpern, Moses, and Vardi 1995, Chapter 3]. 

Briggs [2012] has extended the language CL to allow disjunctions to appear inside the [] (so 
that we can write, for example, [X <- 1VY « 1](Z = 1)) and to allow nested counterfactuals 
(so that we can write [X «+ 1]({[Y < 1](Z = 1)). She gives semantics to this language in a 
way that generalizes that given here and provides a sound and complete axiomatization. 
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Chapter 6 


Responsibility and Blame 


Responsibility is a unique concept ... You may share it with others, but your por- 
tion is not diminished. 


Hyman G. Rickover 


I enjoyed the position I was in as a tennis player. I was to blame when I lost. I 
was to blame when I won. And I really like that, because I played soccer a lot 
too, and I couldn’t stand it when I had to blame it on the goalkeeper. 


Roger Federer 


Up to now I have mainly viewed causality as an all-or-nothing concept. Although that position 
is mitigated somewhat by the notion of graded causality discussed in Chapter 3, and by the 
probabilistic notion of causality discussed in Section 2.5, it is still the case that either A is a 
cause of B or it is not (in a causal setting (17, @)). Graded causality does allow us to think of 
some causes as being better than others; with probability, we can take a better cause to be one 
that has a higher probability of being the cause. But we may want to go further. 

Recall the voting scenario from Example 2.3.2, where there are 11 voters. If Suzy wins the 
vote 6—5, then all the definitions agree that each voter is a cause of Suzy’s victory. Using the 
notation from the beginning of Section 2.1, in the causal setting (MW, u) where V; = --- = 
Ve = land V7 = --- = Vij; = 0, so each of the first six voters voted for Suzy, and the rest 
voted for Billy, each of V; = 1,7 =1,...,6is acause of W = 1 (Suzy’s victory). 

Now consider a context where all 11 voters vote for Suzy. Then according to the original 
and updated HP definition, it is now the case that V; = 1 is a cause of W = 1 for2 = 
1,...,11. But intuitively, we would like to say that each voter in this case is “less” of a 
cause than in the case of the 6-5 victory. Graded causality doesn’t help here, since it does 
not allow us to compare degrees of causality across different causal settings; we cannot use 
graded causality to say that a voter in the case of a 6—5 win is a “better” cause than a voter in 
the case of an 11-0 win. 

To some extent, this problem is dealt with by the modified HP definition. In the case of 
the 6—5 victory, each voter is still a cause. In the case of the 11-0 victory, each voter is only 


169 
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part of a cause. As observed in Example 2.3.2, each subset J of six voters is a cause; that is, 
AietV; = 1 is a cause of W = 1. Intuitively, being part of a “large” cause should make a 
voter feel less responsible than being part of a “small” cause. In the next section, I formalize 
this intuition (for all the variants of the HP definition). This leads to a naive definition of 
responsibility, which nevertheless captures intuitions reasonably well. 


Like the definition of causality, this definition of responsibility assumes that everything 
relevant about the facts of the world and how the world works is known. Formally, this 
means that the definition is relative to a causal setting (JW, i). But this misses out on an 
important component of determining what I will call here blame: the epistemic state. Consider 
a doctor who treats a patient with a particular drug, resulting in the patient’s death. The 
doctor’s treatment is a cause of the patient’s death; indeed, the doctor may well bear degree of 
responsibility | for the death. However, if the doctor had no idea that the treatment had adverse 
side effects for people with high blood pressure, then he should perhaps not be blamed for the 
death. Actually, in legal arguments, it may not be so relevant what the doctor actually did 
or did not know, but what he should have known. Thus, rather than considering the doctor’s 
actual epistemic state, it may be more important to consider what his epistemic state should 
have been. But, in any case, if we are trying to determine whether the doctor is to blame for 
the patient’s death, we must take into account the doctor’s epistemic state. 


Building on the definition of responsibility, in Section 6.2, I present a definition of blame 
that considers whether agent a performing action b is to blame for an outcome y. The def- 
inition is relative to an epistemic state for a, which is a set of causal settings, together with 
a probability on them. The degree of blame is then essentially the expected degree of re- 
sponsibility of action b for y. To understand the difference between responsibility and blame, 
suppose that there is a firing squad consisting of ten excellent marksmen. Only one of them 
has live bullets in his rifle; the rest have blanks. The marksmen do not know which of them 
has the live bullets. The marksmen shoot at the prisoner and he dies. The only marksman who 
is the cause of the prisoner’s death is the one with the live bullets. That marksman has degree 
of responsibility 1 for the death; all the rest have degree of responsibility 0. However, each of 
the marksmen has degree of blame 1/10. 


The definition of responsibility that I give in Section 6.1 is quite naive, although it does 
surprisingly well at predicting how people ascribe causal responsibility. However, people 
depart from this definition in some systematic ways. For one thing, people seem to take 
normality into account. Thus, to some extent, graded causality is getting at similar intuitions 
as the notion of responsibility. To make matters worse, people seem to confound blame and 
responsibility when ascribing responsibility; in particular, they take into account the prior 
probabilities of the outcome, before the action was performed. All this suggests that there 
may not be a clean, elegant definition of responsibility that completely matches how people 
ascribe responsibility. I discuss this in more detail in Section 6.3. 


Before going on, a few words on the choice of words. Although I believe that the defi- 
nitions of “responsibility” and “blame” presented here are reasonable, they certainly do not 
capture all the connotations of these words as used in the literature or in natural language. In 
the philosophy literature, papers on responsibility typically are concerned with moral respon- 
sibility. The definitions given here, by design, do not take into account intentions or what the 
alternatives were, both of which seem necessary in dealing with moral issues. (The model 
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implicitly encodes the existence of alternative actions in the range of variables representing 
actions, but the definition of causality does not take this into account beyond requiring the 
existence of at least a second alternative before saying that a particular action is a cause of an 
outcome.) 

For example, there is no question that Truman was in part responsible and to blame for 
the deaths resulting from dropping the atom bombs on Hiroshima and Nagasaki. However, 
to decide whether this is a morally reprehensible act, it is also necessary to consider the 
alternative actions he could have performed and their possible outcomes. The definitions 
given do not address these moral issues, but I believe that they may be helpful in elucidating 
them. 

Although “responsibility” and “blame” as I will define them are distinct notions, these 
words are often used interchangeably in natural language. Phrases and terms like “degree 
of causality”, “degree of culpability”, and “accountability” are also used to describe similar 
notions. I think it is important to tease out the distinctions between notions. However, the 
reader should be careful not to read too much into my usage of words like “responsibility” 
and “blame”. 


6.1 A Naive Definition of Responsibility 


I start by giving a naive definition of responsibility for the original HP definition. I give the 
definition in causal models without a normality ordering. I briefly remark after the definition 
how responsibility can be defined in extended causal models. 


Definition 6.1.1 The degree of responsibility of X = x for vy in (M,%) according to the 
original HP definition, denoted dr°((M, a), (X =x), ~), is Oif X = x is not a cause of ¢ in 
(M, i) according to the original HP definition; it is 1/(k + 1) if there is a witness (W, w, x’) 
to X = x being a cause of y in (1, x), 
k; is minimal, in that there is no witness (W’, 7’, x”’) to X = x being a cause of y in (M, i) 
according to the original HP definition such that |W’| < k. 


W| = k according to the original HP definition, and 


> 


Of course, if we consider extended causal models, we should restrict to witnesses (W’, w, 2") 
such that SWroag Saat = sz. (Recall this condition says that the witness world is at least as 
normal as the actual world.) Doing this does not change any of the essential features of the 
definition, so I do not consider the normality ordering further in this section. I consider its 
impact more carefully in Section 6.3. 

Roughly speaking, dr°((V, w),(X = x),y) measures the minimal number of changes 
that have to be made to the world sz in order to make y counterfactually depend on X. If 
X = xis not acause of y, then no partition of V to (Z ; W) makes ~ counterfactually depend 
on X = x, and the minimal number of changes in Definition 6.1.1 is taken to have cardinality 
oo; thus, the degree of responsibility of X = zx is 0. If y counterfactually depends on X = z, 
that is, X = x is a but-for cause of y in (M, i), then the degree of responsibility of X = x in 
y is 1. In other cases, the degree of responsibility is strictly between 0 and 1. Thus, X = visa 
cause of y in (IM, %) according to the original HP definition iff dr°((1/, w), (X = x), py) > 0. 


Downloaded from http://direct.mit.edu/books/book-pdf/2262849/book_9780262336611.pdf by guest on 04 April 2024 


172 Chapter 6. Responsibility and Blame 


Example 6.1.2 Going back to the voting example, it is immediate that each voter who votes 
for Suzy has degree of responsibility | in the case of a 6-5 victory. Each voter for Suzy is a 
but-for cause of the outcome; changing any of their votes is enough to change the outcome. 
However, in the case of an 11-0 victory, each voter has degree of responsibility 1/6. To see 
that the degree of responsibility of, say, Vi = 1 for W = 1 is 1/6 in the latter case, we could, 
for example, take W = {V2,...,Ve} and w = 0. Clearly, (W, 0,0) is a witness to Vj = 1 
being a cause of W = 1. There are many other witnesses, but none has a set W of smaller 
cardinality. 


As this example illustrates, the sum of the degrees of responsibility can be more than 1. 
The degree of responsibility is not a probability! 


Example 6.1.3 Consider the forest-fire example yet again. In the conjunctive model, both 
the lightning and the arsonist have degree of responsibility 1 for the fire; changing either one 
results in there being no fire. In the disjunctive model, they both have degree of responsibility 
1/2. Thus, degree of responsibility lets us distinguish the two scenarios even though, accord- 
ing to the original HP definitions, the lightning and the arsonist are causes of the fire in both 
scenarios. Ij 


The intuition that W is the minimal number of changes that has to be made in order to make 
y counterfactually depend on X is not completely right, as the following example shows. 


Example 6.1.4 Consider a model with three endogenous variables, A, B, and C. The value 
of A is determined by the context, B takes the same value as A, and Cis 0 if A and B agree, 
and | otherwise. The context u is such that A = 0, so B = 0 and C = 0. A = Ois acause 
of C = 0 according to the original HP definition, with witness ({B},0,1): if we fix B at 0, 
then if A = 1, C = 1. It is easy to check that (0, 0, 1) is not a witness. Thus, the degree of 
responsibility of A = 0 is 1/2. However, the value of B did not have to change in order to 
make C’ depend counterfactually on A; rather, B had to be held at its actual value. ff 


Example 6.1.5 In the naive model of the rock-throwing example, Suzy and Billy each have 
degree of responsibility 1/2 for the bottle shattering. In the sophisticated model, Suzy has 
degree of responsibility 1/2 for the outcome, since the the minimal witness for ST = 1 being 
a cause of BS = 1 consists of BH (or BT). In contrast, Billy has degree of responsibility 0 
for the bottle shattering, since his throw was not a cause of the outcome. ff 


These examples also suggest how to define degree of responsibility of X = «x for y in 
(M, %) for the modified and updated HP definitions. Now a cause can involve a set of vari- 
ables, not just a single variable. Both the size of the cause and the size of the witness matter 
for computing the degree of responsibility. 


Definition 6.1.6 The degree of responsibility of X = «x for y in (M,%) according to 
the updated (resp., modified) HP definition, denoted dr“((M,i),(X = «x),y) (resp., 


dr™ ((M, a), (X = x),)), is 0 if X = a is not part of a cause of vy in (IM, w) according 


to the updated (resp., modified) HP definition; it is 1/k if there exists a cause X = Z of yp 
and a witness (W, w, £”) to X = @ being a cause of y in (M, iZ) such that (a) X = « is 
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a conjunct of X = Z, (b) \W| ate |X| = k, and (c) k is minimal, in that there is no cause 
X, = &, for vy in (M, @) and witness (W’, w’, 7) to X, = Z, being a cause of y in (M, z) 
according to the updated (resp., modified) HP definition that includes X = x as a conjunct 
with |W’| + |X| <k. 


It is easy to see that all the definitions of degree of responsibility agree that the degree of 
responsibility is 1/6 for each voter in the case of the 11-0 vote; that A = 0 has degree of 
responsibility 1/2 for C_ = 0 in the causal settings described in Example 6.1.4; that Suzy 
has degree of responsibility 1/2 for the bottle shattering in both the naive and sophisticated 
rock-throwing examples; that the lightning and arsonist both have degree of responsibility 1/2 
in the disjunctive forest-fire model; and that they both have degree of responsibility | in the 
conjunctive forest-fire model. If the distinction between the various HP definitions of causality 
is not relevant for the degree of responsibility, I just write dr, omitting the superscript. 

These examples already suggest that this definition of responsibility does capture some of 
the way people use the word. But this definition is admittedly somewhat naive. I look at it 
more carefully, and consider refinements of it, in Section 6.3. But first I consider blame. 


6.2 Blame 


The definitions of both causality and responsibility assume that the context and the structural 
equations are given; there is no uncertainty. We are often interested in assigning a degree of 
blame to an action. This assignment depends on the epistemic state of the agent before the 
action was performed. Intuitively, if the agent had no reason to believe, before he performed 
the action, that his action would result in a particular outcome, then he should not be held to 
blame for the outcome (even if in fact his action caused the outcome). 

There are two significant sources of uncertainty for an agent who is contemplating per- 
forming an action: 


= what value various variables have—for example, a doctor may be uncertain about 
whether a patient has high blood pressure; 


= how the world works—for example, the doctor may be uncertain about the side effects 
of a given medication. 


In a causal model, the values of variables is determined by the context; “how the world 
works” is determined by the structural equations. Thus, I model an agent’s uncertainty by a 
pair (K’, Pr), where K is a set of causal settings, that is, pairs of the form (VM, i’), and Pr is a 
probability distribution over K:. If we are trying to assign a degree of blame to X = 2, then 
we assume that X = «x holds in all causal settings in K; that is, (M,u) / X = a for all 
(M, ii) € K. The assumption is that X = x is known; K describes the uncertainty regarding 
other features of the world. We typically think of K as describing an agent’s uncertainty before 
he has actually performed X = x, so I do not take vy to be known (although in many cases of 
interest, it will be); K could also be taken to describe the agent’s uncertainty after performing 
X = « and making observations regarding the effect of this action (which will typically 
include y). I return to this point below (see Example 6.2.4). In any case, the degree of blame 
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of X = «x for — is just the expected degree of responsibility of X = x for y, where the 
expectation is taken with respect to Pr. Just as there is a definition of degree of responsibility 
corresponding to each variant of the HP definition, there are definitions of degree of blame 
corresponding to each variant. Here I ignore these distinctions and just write dr for degree 
of responsibility (omitting the superscript) and db for the notion of degree of blame (again, 
omitting the corresponding superscript). 


Definition 6.2.1 The degree of blame of X = «x for yp relative to epistemic state (K, Pr), 
denoted db(K, Pr, X = x,y), is 


S° dr((M, i), X =a,y)Pr((M,é)). 
(M,@)EK 


Example 6.2.2 Suppose that we are trying to compute the degree of blame of Suzy’s throw- 
ing the rock for the bottle shattering. Suppose that the only causal model that Suzy considers 
possible is essentially like that of Figure 2.3, with some minor modifications: BT can now 
take on three values, say 0, 1, 2. As before, if BT = 0, then Billy doesn’t throw, and if 
BT = 1, then Billy does throw; if BT’ = 2, then Billy throws extra hard. Assume that the 
causal model is such that if BT = 1, then Suzy’s rock will hit the bottle first, but if BT = 2, 
then they will hit simultaneously. Thus, SH = 1if ST = 1, and BH = 1if BT = 1 and 
SH =0Oorif BT = 2. Call this structural model M. 
At time 0, Suzy considers the following four causal settings equally likely: 


» (M,t1), where i, is such that Billy already threw at time 0 (and hence the bottle is 
shattered); 


» (M, ti2), where the bottle was whole before Suzy’s throw, and Billy throws extra hard, 
so Billy’s throw and Suzy’s throw hit the bottle simultaneously (this essentially gives 
the model in Figure 2.2); 


« (M, t%3), where the bottle was whole before Suzy’s throw, and Suzy’s throw hit before 
Billy’s throw (this essentially gives the model in Figure 2.3); and 


« (M, t%4), where the bottle was whole before Suzy’s throw, and Billy did not throw. 


The bottle is already shattered in (MV, v1) before Suzy’s action, so Suzy’s throw is not a cause 
of the bottle shattering, and her degree of responsibility for the shattered bottle is 0. As 
discussed earlier, the degree of responsibility of Suzy’s throw for the bottle shattering is 1/2 
in both (M, u%) and (M,ii3) and 1 in (M,w4). Thus, the degree of blame of ST = 1 for 
BS=lisg+g tg stag l=5-N 


Example 6.2.3 Consider again the example of the firing squad with ten excellent marksmen. 
Suppose that marksman | knows that exactly one marksman has a live bullet in his rifle and 
that all the marksmen will shoot. Thus, he considers 10 contexts possible, depending on who 
has the bullet. Let p; be his prior probability that marksman 7 has the live bullet. Then the 
degree of blame (according to marksman 1) of the 7th marksman’s shot for the death is p;. 
The degree of responsibility is either | or 0, depending on whether marksman 2 actually had 
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the live bullet. Thus, it is possible for the degree of responsibility to be 1 and the degree of 
blame to be 0 (if marksman 1 mistakenly ascribes probability 0 to his having the live bullet, 
when in fact he does), and it is possible for the degree of responsibility to be 0 and the degree 
of blame to be | (if he mistakenly ascribes probability | to his having the bullet when in fact 
he does not). Jf 


As I said earlier, typically the probability Pr on the set K of causal settings can be thought 
of as being the agent’s prior probability, before observing the consequences of having X = a. 
But it makes perfect sense to take other choices for Pr. For example, it could be the agent’s 
posterior probability, after making observations. In legal settings, it might be important to 
consider not the agent’s actual prior (or posterior) probability—that is, the agent’s actual epis- 
temic state—but what the agent’s probability should have been. The definition of degree of 
blame does not change, however Pr is interpreted; it just takes the probability Pr as an in- 
put. But the choice can clearly affect the degree of blame computed. The following example 
illustrates some of the key differences. 


Example 6.2.4 Consider a patient who dies as a result of being treated by a doctor with a 
particular drug. Assume that the patient died due to the drug’s adverse side effects on people 
with high blood pressure and, for simplicity, that this was the only cause of death. Suppose 
that the doctor was not aware of the drug’s adverse side effects. (Formally, this means that he 
does not consider possible a causal model where taking the drug causes death.) Then, relative 
to the doctor’s actual epistemic state, the doctor’s degree of blame will be 0. The key point 
here is that the doctor will ascribe high probability to causal setting in K where he treats the 
patient and the patient does not die. However, a lawyer might argue in court that the doctor 
should have known that this treatment had adverse side effects for patients with high blood 
pressure (because this fact is well documented in the literature) and thus should have checked 
the patient’s blood pressure. If the doctor had performed this test, then he would of course 
have known that the patient had high blood pressure. With respect to the resulting epistemic 
state, the doctor’s degree of blame for the death is quite high. Of course, the lawyer’s job is 
to convince the court that the latter epistemic state is the appropriate one to consider when 
assigning degree of blame. 

In any case, the doctor’s actual epistemic state and the epistemic state he arguably should 
have had before observing the effects of the drug might both be different from his epistemic 
state after treating the patient with the drug and observing that he dies. Not knowing the 
literature, the doctor may still consider it possible that the patient died for reasons other than 
the treatment, but will consider causal models where the treatment was a cause of death more 
likely. Thus, the doctor will likely have a higher degree of blame relative to his epistemic state 
after the treatment. ff 


Interestingly, all three epistemic states (the epistemic state that an agent actually has before 
performing an action, the epistemic state that the agent should have had before performing 
the action, and the epistemic state after performing the action) have been considered relevant 
to determining responsibility according to different legal theories. Thinking in terms of the 
relevant epistemic state also helps clarify some earlier examples that seemed problematic. 
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Example 6.2.5 Recall Example 2.3.8, where company A dumps 100 kilograms of pollutant, 
company B dumps 60 kilograms, and biologists determine that / kilograms of pollutant suf- 
fice for the fish to die. The “problematic” case was when k = 80. In this case, only A is a 
cause of the fish dying according to the modified HP definition. A is also a cause according 
to the original and updated HP definition; whether B is a cause depends on whether A’s only 
choices are to dump either 0 or 100 kilograms of pollutant (in which case B is not a cause) or 
whether A can dump some intermediate amount between 20 and 79 kilograms (in which case 
B is a cause). 

Although it may not be clear what the “right” answer is here, it does seem disconcerting 
that B escapes responsibility if we use the modified HP definition, and that 6’s degree of 
responsibility should depend so heavily on the amount of pollutant that A could have dumped 
in the case of the original and updated definition. Thinking in terms of blame gives what seem 
to be more reasonable answers here. When trying to decide how blameworthy 5’s action is, 
we typically do not think that B will know exactly how much pollutant A will dump, nor that 
B will know how much pollutant is needed to kill fish. A causal model for B can thus be 
characterized by two parameters: the amount a of pollutant that A dumps and the amount & 
of pollutant that is needed to kill the fish. We can now consider B’s degree of responsibility 
in a causal model as a function of a and k: 


«Ifa < k < a+ 60, B’s dumping 60 kilograms of pollutant is a cause and thus has 
degree of responsibility | according to all variants of the HP definition. 


If k > a+60, then B’s action is not a cause and has degree of responsibility 0 according 
to all variants of the HP definition. 


If k < aand k < 60, then B has degree of responsibility 1/2 according to all variants 
of the HP definition. 


If k < a, k > 60, then B has degree of responsibility 0 according to all variants of 
the HP definition if A can dump only 0 or 100 kilograms of pollutant; if A can dump 
between k — 60 and k kilograms, then B still has degree of responsibility 0 according to 
the modified HP definition but has degree of responsibility 1/2 according to the original 
and updated definition. 


The point here is that with reasonable assumptions about what 6’s state of mind should have 
been, no matter which variant of the HP definition we use, B gets a positive degree of blame. 
The degree of blame is likely to depend on factors like how close & is to 60 and the difference 
between a and k; that is, how close B was to making a difference. This indeed seems to track 
how juries might assign blame. Moreover, if company B could convince the jury that, before 
dumping pollutant, it had hard evidence that company A had already dumped more than k 
kilograms of pollutant (so that B was certain that its act would make no difference), then 
there is a good chance that the jury would ascribe less blame to B. (Of course, a jury may still 
want to punish B for the deterrent value of the punishment, but that is a matter orthogonal to 
the notion of blame that is being captured here.) 

Similar considerations show that, under minimal assumptions, if the amount of pollutant 
dumped by A and B is known, but there is uncertainty about k, then A’s degree of blame will 
be higher than that of B. This again seems to accord with intuition. §f 
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Example 6.2.6 Voting has often been viewed as irrational. After all, the probability that one 
individual’s vote can affect the outcome is typically extremely small. Indeed, the probability 
that an individual’s vote can affect the outcome of the U.S. presidential election has been 
calculated to be roughly 10°. (Of course, if almost everyone decides not to vote on the 
grounds that their vote is extremely unlikely to affect the outcome, then the few people who 
actually vote will have a major impact on the outcome.) 


In any case, people clearly do vote. There has been a great deal of work trying to explain 
why people vote and defining reasonable utilities that people might have that would make 
voting rational. One approach that I find compelling is that people feel a certain degree of 
responsibility for the outcome. Indeed, in a 2-candidate race where A gets a votes and B gets 
b votes, with a > b, assuming that it requires a majority to win, all the approaches agree that a 
voter for A has degree of responsibility Tan by aT for the outcome (where [x] is the smallest 


integer greater than or equal to x, so that, for example, [5/2] = 3 and [3] = 3). Thus, in 
an 11-0 victory, each voter has degree of responsibility 1/6, since [11/2] = 6, and in a 6-5 
victory, each voter has degree of responsibility 1. Each voter can then carry out an analysis 
in the spirit of Example 6.2.5 to compute his degree of blame for the outcome. The argument 
goes that he should then vote because he bears some degree of responsibility/blame for the 
outcome. 


Of course, it then seems reasonable to ask how that degree of responsibility and blame 
change depending on whether he votes. There is an intuition that someone who abstains 
should be less responsible for an outcome than someone who votes for it. Although certainly 
someone who abstains can never be more responsible for an outcome than someone who votes 
for it, they may be equally responsible. For example, if we take ties to be undecided and Suzy 
wins a 6—5 vote against Billy where there is one abstention, then the abstainer and each of the 
six voters for Suzy are but-for causes of the outcome and thus have degree of responsibility 
1. If the vote is 7-5 for Suzy with one abstention, then again each of the voters for Suzy has 
degree of responsibility 1 for all variants of the HP definition, since they are but-for cause 
of the outcome. However, the abstainer has degree of responsibility 1/2: if a voter for Suzy 
abstains, then the abstainer’s vote becomes critical, since he can make the outcome a draw by 
voting for Billy. 


More generally, if a voters vote for Suzy, b voters vote for Billy, c voters abstain, and a > b, 
then, as we have seen, the degree of responsibility of one of Suzy’s voters for the outcome is 
ICED If a — bis odd, this is also the degree of responsibility of an abstainer. However, 
if a — bis even and c > 2, then the degree of responsibility of an abstainer is slightly lower: 
Taxby 7271 Finally, if a — b is even and c = 1, then the abstainer has degree of responsibility 
0 for all variants of the HP definition. To understand why this should be so, suppose that the 
vote is 9-5 with 2 abstentions. Then if we switch one of the voters for Suzy to voting for Billy 
and switch one abstainer to voting for Billy, then the vote becomes 8-7; now if the second 
abstainer votes for Billy, Suzy is no longer the winner. Thus, the second abstainer has degree 
of responsibility 1/3 for Suzy’s victory. However, if there is only one abstention, there are no 
vote flips that would make the abstainer critical; the abstainer is not a cause of the outcome. 
This argument applies to all the variants of the HP definition. 
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We can “smooth away” this dependence on the parity of a — b and on whether c = 1 or 
c > 1 by considering the degree of blame. Under minimal assumptions, the degree of blame 
of an abstainer will be (slightly) less than that of a voter for the victor. 

Is the increase in degree of blame for the outcome (here perhaps “responsibility” is a better 
word) enough to get someone to vote? Perhaps people don’t consider the increase in degree 
of blame/responsibility when deciding whether to vote; rather, they say (correctly) that they 
feel responsible for the outcome if they vote and believe that that’s a good thing. The impact 
of responsibility on voting behavior deserves further investigation. If 


R F LI EU 


411] 


M RI SD LL ES 


Figure 6.1: Attenuation in a causal chain. 


Example 6.2.7 Recall the causal chain example from Section 3.4.4. The causal network for 
this example is reproduced in Figure 6.1. Although both M = 1 and LL = 1 are causes of 
ES = 1 according to all the variants of the HP definition (indeed, they are but-for causes), 
taking graded causality into account, the original and updated HP definition captured the 
attenuation of causality along the chain. A cause close to the outcome, like LL = 1, is viewed 
as being a better cause than one further away, like 1/ = 1. The modified HP definition was 
not able to obtain this result. 

However, once we take blame into account and make reasonable assumptions about an 
agent’s probability distribution on causal models, all the definitions again agree that LL = 1 
is more to blame for the outcome than MZ = 1. Recall that in the actual context in this 
example, MW = R= F = LI = EU =1,s0 all the variables are 1. However, the equations 
are such that if any of these variables is 0, then ES = 0. Now suppose that an agent is 
uncertain as to the values of these variables. That is, although the agent knows the causal 
model, he is uncertain about the context. For all variants of the HP definition, LL = lisa 
cause (and hence has degree of responsibility 1) in all causal settings where HU = 1, and 
otherwise is not a cause. Similarly, IZ = 1 is a cause and has degree of responsibility | in the 
causal setting where R = F = LI = EU = 1, and otherwise is not a cause. Again, under 
minimal assumptions, the degree of blame of IZ = 1 for ES = 1 is strictly less than that of 
LL = 1 (and, more generally, attenuates as we go along the chain). ff 


Example 6.2.8 Consider the classic bystander effect: A victim is attacked and is in need of 
help. The probability that someone helps is inversely related to the number of bystanders: 
the greater the number of bystanders, the less likely one of them is to help. The bystander 
effect has often been explained in terms of “diffusion of responsibility” and the degree of 
responsibility that a bystander feels. 

Suppose for simplicity that as long as one bystander gets involved, the victim will be 
saved; otherwise the victim will die. That means that if no one gets involved, each bystander 
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is a cause of the victim’s death and has degree of responsibility 1. This seems inconsistent 
with the intuition that the more bystanders there are, the less responsibility a bystander feels. 
It actually seems that the notion of blame as I have defined it is capturing the intuition here, 
rather than the notion of responsibility. (Recall my earlier comment that English tends to use 
these words interchangeably.) 

For suppose that we assume that a bystander sees n other bystanders and is deciding 
whether to get involved. Further suppose (as seems quite plausible) that there is a signifi- 
cant cost to being involved in terms of time, so, all things being equal, the bystander would 
prefer not to be involved. However, the bystander does not want to get blamed for the out- 
come. Of course, if he gets involved, then he will not be blamed. If he does not get involved, 
then his degree of blame will depend on what others do. The lower his perceived blame, the 
less likely the bystander is to get involved. Thus, the bystander must compute his degree of 
blame if he does not get involved. Assume that which other bystanders will get involved is de- 
termined by the context. Note that a bystander who gets involved has degree of responsibility 
1/(k + 1) for the victim surviving in a causal setting where & other bystanders get involved; 
a bystander who does not get involved has degree of responsibility 0 for the victim surviving. 
Getting an accurate model of how many other bystanders will get involved is, of course, rather 
complicated (especially since other bystanders might be thinking along similar lines). But for 
our purposes, it suffices that bystanders assume that the more bystanders there are, for each 
k; > 0, the more likely it is that at least & bystanders will get involved. With this assumption, 
the bystander’s degree of blame decreases the more bystanders there are. If he uses the rule 
of getting involved only if his degree of blame (i.e., expected degree of responsibility) for 
the victim’s survival is greater than some threshold, and uses this naive analysis to compute 
degree of blame (which does not seem so unreasonable in the heat of the moment), then the 
more bystanders there are, the less likely he is to get involved. ff 


6.3. Responsibility, Normality, and Blame 


The definition of responsibility given in Section 6.1 is rather naive. Nevertheless, experiments 
have shown that it does give qualitatively reasonable results in many cases. For example, when 
asked to judge the degree of responsibility of a voter for a victory, they give each voter lower 
and lower responsibility as the outcome goes from 1-0 to 4—0 (although it is certainly not the 
case that they ascribe degree of responsibility 1 in the case of a 2-0 and 1/2 in the case of a 
3-0 victory, as the naive definition would suggest). However, this definition does not capture 
all aspects of how people attribute responsibility. In this section, I discuss some systematic 
ways in which people’s responsibility attributions deviate from this definition and suggest 
some improvements to the definition that takes them into account. 

One problem with the naive definition of responsibility is that it is clearly language depen- 
dent, as the following example shows. 


Example 6.3.1 Consider a vote where there are two large blocs: bloc A controls 25 votes 
and bloc B controls 25 votes. As the name suggests, all the voters in a bloc vote together, 
although in principle, they could vote differently. There is also one independent voter; call her 
C. A measure passes unanimously, by a vote of 51-0. What is the degree of responsibility of 
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C for the outcome? If the only variables in the model are A, B, C, and O (for outcome), then 
it is 1/2; we need to change either A or B’s vote to make C’s vote critical. But this misses 
out on the fact that A and B each made a more significant contribution to the outcome than 
C. If instead we use variables Ay,...,A25,By,...,Bo5, where A; = 1 if the ith voter in 
bloc A votes in favor, and similarly for B;, then C’s degree of responsibility drops to 1/26. 
Intuitively, changing bloc A from voting for the outcome to voting against is a much “bigger” 
change than changing C’' from voting for the outcome to voting against. The current definition 
does not reflect that (although I suspect people’s responsibility judgments do). ff 


Example 3.5.1 captures a related phenomenon. 


Example 6.3.2 Consider Example 3.5.1 again. Recall that, in this example, there is a team 
consisting of Alice, Bob, Chuck, and Dan. In order to compete in the International Salsa 
Tournament, the team must have at least one male and one female member. All four of the 
team members are supposed to show up for the competition, but in fact none of them does. 
To what extent is each of the team members responsible for the fact that the team could not 
compete? 

The naive definition of responsibility says that all team members have degree of respon- 
sibility 1/2. For example, if Alice shows up, then each of Bob, Chuck, and Dan becomes 
critical. However, there is an intuition that Alice is more responsible for the outcome than any 
of Bob, Chuck, or Dan. Intuitively, there are three ways for Alice to be critical: if any of Bob, 
Chuck, or Dan show up. But there is only one way for Dan to be critical: if Alice shows up. 
Similarly for Bob and Chuck. ff 


Recall that in Chapter 3, the alternative definition of normality took into account the fact 
that there were more witnesses for Alice being a cause of the team not being able to participate 
than for each of Bob, Chuck, or Dan, and ended up grading Alice a “better” cause. It is not 
hard to modify Definitions 6.1.1 and 6.1.6 to also take this into account. For example, in 
Definition 6.1.1, we could count not just the size of the minimal witness set but the number of 
witness sets of that size. Indeed, we could take into account how many witness sets there are 
of each size. (See the notes at the end of the chapter for an explicit approach to doing this.) 

This example also suggests a connection between responsibility and normality. Thinking 
in terms of normality could help explain why changing the votes of an entire bloc of voters in 
Example 6.3.1 is viewed as having a bigger impact than just changing the vote of one voter: 
we can think of changing the votes of an entire bloc of voters as being a more abnormal change 
than just changing the vote of a single voter. 

The connection between normality and responsibility seems to run deep. The next example 
provides another illustration of the phenomenon. 


Example 6.3.3 Suppose that there are five people on a congressional committee. Three votes 
are required for a measure to pass the committee. A, B, and C’ vote against the procedure, 
while D and E£ vote for it, so the measure fails. Each of A, B, and C is a cause of the measure 
failing, and thus have degree of responsibility 1. But now suppose that A is a Democrat, 
whereas B and C’ are Republicans. Moreover, the Democratic party favors the measure and 
the Republican party is opposed to it. Does that change things? For most people, it does. 
Going back to graded causality, A would be viewed as a “better” cause of the outcome than 
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B because the witness world where A votes for the measure is more normal than the world 
where B votes for the measure. Sure enough, people do assign A more responsibility for 
the outcome than either B or C. This suggests that people do seem to be taking normality 
considerations into account when ascribing degree of responsibility. ff 


Gerstenberg, Halpern, and Tenenbaum did an extensive series of experiments on voting 
behavior of the type discussed in Example 6.3.3. The number of committee members was 
always either 3 or 5. They varied the number of votes required for the measure to pass, the 
party affiliations of the committee members, and whether the parties supported the measure 
or opposed it. As expected, pivotality considerations, that is, how far a voter was from being 
critical to the outcome, in the sense captured by Definitions 6.1.1 and 6.1.6, were highly cor- 
related with the outcome. Normality considerations also affected the outcome. The normality 
considerations came in two flavors. First, voting against your party was considered abnormal, 
as suggested in Example 6.3.3. But this does not completely explain the observations. In the 
causal setting discussed in Example 6.3.3, when all voters should have responsibility | ac- 
cording to the definition, in fact, the average degree of responsibility ascribed (averaged over 
subjects in the experiment) was about .75 to voter A and .5 to voter B. 


We can get some insight into this phenomenon if we consider votes where party consid- 
erations do not apply. For example, suppose that there are three committee members, all in 
the same party, all three vote for a measure, and unanimity is needed for the measure to pass. 
This means that all three voters are but-for causes of the outcome and should have degree of 
responsibility 1 according to the naive definition. But again, subjects only assign them an 
average degree of responsibility of .75. Interestingly, if only one vote is needed for the mea- 
sure to pass and only one committee member votes for it, then that committee member gets 
an average degree of responsibility very close 1. So somehow responsibility is being diffused 
even when everyone who voted for the measure is a but-for cause. 

One way of understanding this is by considering a different normality criterion. Note that 
if only one vote is needed for the measure to pass, there are three voters, and the vote is 2—1 
against, then the voter who voted in favor is “bucking the trend” and acting more abnormally 
than in the case where all three voters voted for the measure and three votes are needed to 
pass it. Thus, the difference between the two assignments of responsibility can be understood 
in terms of normality considerations if we identify degree of normality with “degree of going 
along with the crowd”. 

Taking normality into account also lets us account for another phenomenon that we see 
with responsibility attributions: they typically attenuate over a long chain. Recall the example 
from Section 3.4.4, where a lit match aboard a ship caused a cask of rum to ignite, causing 
the ship to burn, which resulted in a large financial loss by Lloyd’s insurance, leading to the 
suicide of a financially ruined insurance executive. Although each of the events along the 
causal path from the lit match to the suicide is a (but-for) case of the suicide, as we saw 
in Section 3.4.4, a reasonable normality ordering results in witnesses for causes further up 
the chain being viewed as less normal than the witnesses for causes closer to the suicide, 
so the graded notion of normality deals with this well, at least for the original and modified 
and updated definition. By incorporating these normality considerations into the notion of 
responsibility, responsibility will also attenuate along causal chains, at least with the original 
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and updated definition. (I discuss another approach to dealing with this problem, based on the 
observations in Example 6.2.7, shortly.) 

I now sketch one way of incorporating normality considerations more formally into the 
notion of degree of responsibility The idea is to consider the normality of the changes required 
to move from the actual world to a witness world. Going back to the naive definition of 
responsibility, suppose that we start in a context i. Changing the value of a variable from its 
value in w or holding a variable fixed at a value inconsistent with other changes is, all else 
being equal, abnormal, but not all changes or clampings of variables are equally abnormal. 
Changing the votes of a large bloc of voters is more abnormal, in general, than changing the 
vote of one voter; changing a vote so that it is more aligned with the majority is less abnormal 
than changing it to be less aligned with the majority; changing the vote of a voter so that 
it is more aligned with his party’s views is more normal than changing it so that it is less 
aligned with his party’s view; and so on. We would expect the degree of responsibility to be 
inversely related to the “difficulty” of getting from the actual world to the witness, where the 
difficulty depends on both the “distance” (i.e., how many changes have to be made to convert 
the actual world to the witness world—that is what is being measured by the naive definition 
of responsibility) and how normal these changes are—the larger the increase in normality 
(or the smaller the decrease in normality), the shorter the distance, and hence the greater the 
responsibility. 

An important point here is that when we judge the abnormality of a change, we do so from 
the perspective of the event whose responsibility we are judging. The following example 
should help clarify this. 


Example 6.3.4 Recall that in Example 3.5.2, A flips a switch, followed by B (who knows 
A’s choice). If they both flip the switch in the same direction, C’ gets a shock. In the actual 
context, B wants to shock C’, so when A flips the switch to the right, so does B, and C gets a 
shock. We now want to consider the extent to which A and B are responsible for C’s shock. 
In the witness showing that A is responsible, A’s flip has to change from left to right and B’s 
flip has to be held constant. Given that B wants to shock C’,, holding B’s flip constant is quite 
abnormal. Now it is true that wanting to shock someone is abnormal from the point of view of 
societal norms, but since A doesn’t want to shock C, this norm is not relevant when judging 
A’s responsibility. By way of contrast, society’s norms are quite relevant when it comes to 
judging B’s responsibility. For B to be responsible, the witness world is one where B’s flip 
changes (and A’s flip stays the same as it is in the actual context). Although it is abnormal to 
change B’s flip in the sense that doing so goes against 6’s intent, it makes for a more normal 
outcome according to society’s norms. The net effect is to make B’s change not so large. The 
move from the actual world to the witness world for A being a cause results in a significant 
decrease in normality; the move from the actual world to the witness world for B being a 
cause results in a much smaller decrease in normality (or perhaps even an increase). Thus, 
we judge B to be more responsible. Similar considerations apply when considering Billy’s 
medical condition (see the discussion in Example 3.5.2). 

Note that these considerations are similar to those used when applying ideas of normality 
to judge causality, provided that we do not just look at the normality of the witness worlds 
involved but rather consider the normality of the changes involved in moving from the actual 
world to the witness world, relative to the event we are interested in. §f 
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One final confounding factor when it comes to responsibility judgments is that, as I said 
earlier, people seem to conflate responsibility and blame, as I have defined them. Recall the 
discussion of the bystander effect in Example 6.2.8. Suppose people are asked to what degree 
a bystander is responsible for the victim’s death if there are n bystanders, none of whom 
acted. We would expect that the larger n is, the lower the degree of responsibility assigned to 
each bystander. Part of this can be explained by the normality effect just discussed. The more 
bystanders there are doing nothing, the more abnormal it is for someone to act (in some sense 
of the word “abnormal’”), so the lower the degree of responsibility for each one. But it may 
also be that when asked about the degree of responsibility of a bystander, people are actually 
responding with something like the degree of blame. As we have already seen, under some 
reasonable assumptions, the degree of blame goes down as n increases. 

Thinking in terms of blame also provides an approach to dealing with the apparent atten- 
uation of responsibility along causal chains, namely, the claim is that what is attenuating is 
really blame (in the technical sense that I have defined it in Section 6.2), not responsibility. 
As I pointed out in the discussion in Example 6.2.7, for all variants of the HP definition, the 
degree of blame attenuates as we go along the chain. 

This discussion suggests that we may want to combine considerations of normality and 
blame. The alternative approach to incorporating normality given in Section 3.5 gives us a 
useful framework for doing so, especially when combined with some of the ideas in Chapter 5. 
In the alternative approach to normality, I assume a partial preorder on contexts that is viewed 
as a normality ordering. For blame, I assumed a probability on contexts in this section, but 
that was mainly so that we could compute blame as the expected degree of responsibility. 
I discussed in Sections 5.2.1 and 5.5.1 how a partial preorder on worlds can be viewed as 
algebraic conditional plausibility measure. Recall that an algebraic plausibility measure is just 
a generalization of probability: it is a function that associates with each event (set of worlds) 
a plausibility, and it has operations © and & that are analogues of addition and multiplication, 
so that plausibilities can be added and multiplied. If we also take the degree of responsibility 
to be an element of the domain of the plausibility measure, we can still take blame to be 
the expected degree of responsibility, but now it can be a more qualitative plausibility, rather 
than necessarily being a number. This is perhaps more consistent with the more qualitative 
way people use and assign blame. In any case, this viewpoint lets us work with blame and 
normality in one framework; instead of using probability or an partial preorder on worlds, we 
use a plausibility measure (or perhaps two plausibility measures: one representing likelihood 
and the other representing normality in the sense of norms). Doing this also lets us incorporate 
normality considerations into blame in a natural way and lets us work with likelihood without 
necessarily requiring a probabilistic representation of it. It is not yet clear how much is gained 
by this added generalization, although it seems worth exploring. 


Notes 


The definitions of responsibility and blame given in Sections 6.1 and 6.2 are largely taken 
from [Chockler and Halpern 2004], as is much of the discussion in these sections. However, 
there is a subtle difference between the definition of responsibility given here for the original 
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and updated definition and the one given by Chockler and Halpern. In computing the degree 
of responsibility of X = zx for y, the Chockler-Halpern definition counted the number of 
changes needed to make X = z critical. Thus, for example, in Example 6.1.4, it would 
have taken the degree of responsibility of A = 0 for C = 1 to be 1, since the value of B 
did not have to change to make A = 0 critical. Similarly, it would take Suzy’s throw in the 
sophisticated rock-throwing model to be 1. The current definition seems to be a more accurate 
representation of people’s responsibility ascriptions. 


The definition of blame given here is also slightly modified from the original Chockler- 
Halpern definition. Here I assume that all the contexts w in the set used to assign a degree 
of blame to X = # also satisfy X = #. In [Chockler and Halpern 2004], it was assumed 
that the contexts in K represented the agent’s epistemic state prior to observing X =i, and 
the degree of responsibility of X =i for y was computed with respect to (My, ,, tu), not 
(M,ii). The choice here is arguably more reasonable, given that my focus has been on 
assigning degree of blame and responsibility to X = @ in circumstances where X = £holds. 
Note that the agent might well assign different probabilities to settings if X = Zis observed 


than if X = Zis the result of an intervention setting X to Z. 


There is an analysis of the complexity of computing the degree of responsibility and blame 
for the original HP definition by Chockler and Halpern [2004], for the updated definition 
by Aleksandrowicz et al. [2014], and for the modified definition by Alechina, Halpern, and 
Logan [2016]. 


Tim Williamson [private communication, 2002] suggested the marksmen example to dis- 
tinguish responsibility from blame. The law apparently would hold the marksmen who fired 
the live bullet more accountable than the ones who fired blanks. To me, this just says that the 
law is taking into account both blame and responsibility when it comes to punishment. I think 
there are advantages to carefully distinguishing what I have called “blame” and “responsibil- 
ity” here, so that we can then have a sensible discussion of how punishment should depend on 
each. 


Zimmerman [1988] gives a good introduction to the literature on moral responsibility. 
Somewhat surprisingly, most of the discussion of moral responsibility in the literature does 
not directly relate moral responsibility to causality. Shafer [2001] does discuss a notion of 
responsibility that seems somewhat in the spirit of the notion of blame as defined here, espe- 
cially in that he views responsibility as being based (in part) on causality. However, he does 
not give a formal definition of responsibility, so it is hard to compare his notion to the one 
used here. Moreover, there are some significant technical differences between his notion of 
causality and the HP definition, so a formalization of Shafer’s notion would undoubtedly be 
different from the notions used here. 


Hart and Honoré [1985, p. 482] discuss how different epistemic states are relevant for 
determining responsibility. 

Goldman [1999] seems to have been the first to explain voting in terms of (causal) respon- 
sibility. His analysis was carried out in terms of Mackie’s INUS condition (see the notes to 
Chapter 1), but essentially the same analysis applies here. He also raised the point that some- 
one who abstains should have a smaller degree of responsibility for an outcome than someone 
who votes for it, but did not show formally that this was the case in his model. Riker and 
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Ordeshook [1968] estimated the probability of a voter being decisive in a U.S. presidential 
election as 107°. 


Gerstenberg and Lagnado [2010] showed that the notion of responsibility defined here 
does a reasonable job in characterizing people’s causality ascriptions. Zultan, Gerstenberg, 
and Lagnado [2012] discuss the effect of multiple witnesses on responsibility ascriptions and 
give Example 3.5.1 as an instance of this phenomenon. They also provide a formal definition 
of responsibility that takes this into account. Given X = 2, y, and a causal setting (JV, u), let 


=a 
it 7 


where f; is the number of witnesses for X = «x being a cause of y, and for each witness 
(Wi, Wi, Li), mM = |W; , fori = 1,...,k. Then the degree of responsibility is taken to be 
1/(.N +1). In the special case where there is only one witness, this definition reduces to Defi- 
nition 6.1.1. Zultan, Gerstenberg, and Lagnado show this more refined notion of responsibility 
is a better predictor of people’s ratings than the one given by Definition 6.1.1. 


The voting experiments discussed in Section 6.3 are described in [Gerstenberg, Halpern, 
and Tenenbaum 2015]. There it is pointed out that, in addition to pivotality and normality, 
what is called criticality is significant. The notion of criticality used is discussed in more 
detail by Lagnado, Gerstenberg, and Zultan [2013]. Criticality is meant to take people’s 
prior beliefs into account. Taking Pr to be the prior probability on causal settings, Lagnado, 
Gerstenberg, and Zultan characterize the criticality of X = x for y by the formula 


Pr(y | X #2) 

Crit(p,X =x) =1 Praleoar 

X = x is assumed to have a positive impact on y, so Pr(y | X = x) > Pr(y | X #2). If 
X = x is irrelevant to y, then Pr(y | X = x) = Pr(y | X 4 x) so Crit(p, X =x) = 0. 
However, if X = x is necessary for y, then Pr(y | X # x) = 1 so Crit(p,X = 2) = 1. 
Lagnado, Gerstenberg, and Zultan then suggest that the ascription of responsibility is some 
function of criticality and pivotality. 


I have not discussed criticality here because I view it as playing essentially the same role 
as blame. Note that if X = z is irrelevant to y, then the degree of blame of X = z for ¢ is 
0; if X = x is necessary for vy in the sense of being a but-for cause of ~ whenever y occurs, 
and y is known to have occurred (i.e., y holds in all causal setting in XC), then the degree of 
blame of X = z for y is 1. Although the degree of blame and degree of criticality are related, 
and both take prior probabilities into account, they are not the same. For example, if X = x 
and y are perfectly correlated, but X = zx is not a cause of y (which would be the case if 
X = x and ¢ have a common cause), then the degree of blame of X = x for y would be 0, 
whereas Crit(y, X = x) would still be 1. It seems to me that degree of blame comes closer 
to describing the impact of the prior probability on responsibility ascriptions than criticality 
in cases like this. But there is no question that, because people seem to conflate blame and 
responsibility, the prior probability will have an impact. 
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Further support for the role of normality in responsibility judgments is provided by the 
work of Kominsky et al. [2014] (although they present a somewhat different account of their 
experimental results than that given here). 

See Chu and Halpern [2008] for details on applying plausibility measures in the context of 
expectation. 
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Explanation 


Good explanations are like bathing suits, darling; they are meant to reveal 
everything by covering only what is necessary. 


E. L. Konigsburg 
There has to be a mathematical explanation for how bad that tie is. 
Russell Crowe 


Perhaps the main reason that people are interested in causality is that they want explanations. 
Consider the three questions I asked in the first paragraph of this book: Why is my friend 
depressed? Why won’t that file display properly on my computer? Why are the bees suddenly 
dying? The questions are asking for causes; the answers would provide an explanation. 

Perhaps not surprisingly, just as with causality, getting a good definition of explanation 
is notoriously difficult. And, just as with causality, issues concerning explanation have been 
the focus of philosophical investigation for millennia. In this chapter, I show how the ideas 
behind the HP definition of causality can be used to give a definition of (causal) explanation 
that deals well with many of the problematic examples discussed in the literature. The basic 
idea is that an explanation is a fact that, if found to be true, would constitute an actual cause 
of the explanandum (the fact to be explained), regardless of the agent’s initial uncertainty. 

As this gloss suggests, the definition of explanation involves both causality and knowledge. 
What counts as an explanation for one agent may not count as an explanation for another 
agent, since the two agents may have different epistemic states. For example, an agent seeking 
an explanation of why Mr. Johansson has been taken ill with lung cancer will not consider the 
fact that he worked for years in asbestos manufacturing a part of an explanation if he already 
knew this fact. For such an agent, an explanation of Mr. Johansson’s illness may include a 
causal model describing the connection between asbestos fibers and lung cancer. However, for 
someone who already knows the causal model but does not know that Mr. Johansson worked 
in asbestos manufacturing, the explanation would involve Mr. Johansson’s employment but 
would not mention the causal model. This example illustrates another important point: an 
explanation may include (fragments of) a causal model. 


187 
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Salmon distinguishes between epistemic and ontic explanations. Roughly speaking, an 
epistemic explanation is one that depends on an agent’s epistemic state, telling him something 
that he doesn’t already know, whereas an ontic explanation is agent-independent. An ontic 
explanation would involve the causal model and all the relevant facts. When an agent asks for 
an explanation, he is typically looking for an epistemic explanation relative to his epistemic 
state; that is, those aspects of the ontic explanation that he does not already know. Both 
notions of explanation seem to me to be interesting. Moreover, having a good definition of 
one should take us a long way toward getting a good definition of the other. The definitions I 
give here are more in the spirit of the epistemic notion. 


7.1 Explanation: The Basic Definition 


The “classical” approaches to defining explanation in the philosophy literature, such as 
Hempel’s deductive-nomological model and Salmon’s statistical relevance model, fail to ex- 
hibit the directionality inherent in common explanations. Despite all the examples in the 
philosophy literature on the need for taking causality and counterfactuals into account, and 
the extensive work on causality defined in terms of counterfactuals in the philosophy litera- 
ture, philosophers have been reluctant to build a theory of explanation on top of a theory of 
causality built on counterfactuals. (See the notes at the end of the chapter.) 

As I suggested above, the definition of explanation is relative to an epistemic state, just 
like that of blame. An epistemic state KC, as defined in Section 6.2, is a set of causal settings, 
with a probability distribution over them. I assume for simplicity in the basic definition that 
the causal model is known, so that we can view an epistemic state as a set of contexts. The 
probability distribution plays no role in the basic definition, although it will play a role in the 
next section, when I talk about the “quality” or “goodness” of an explanation. Thus, for the 
purposes of the following definition, I take an epistemic state to simply be a set K of contexts. 
I think of K as the set of contexts that the agent considers possible before observing y, the 
explanandum. Given a formula y, let Ky = {(M,u) € K : (M,uv) & vy}. Recall that in 
Section 3.5, I defined [y] = {uw : (M,t) — 7}. Using this notation (which will be useful 
later in this chapter), Ky, = KN [YJ]. 

The definition of explanation is built on the definition of sufficient causality, as defined 
in Section 2.6. Just as there are three variants of the definition of causality, there are three 
variants of the definition of explanation. 


Definition 7.1.1 X = Z is an explanation of (p relative to a set K of contexts in a causal 
model M if the following conditions hold: 


EX1. X = Zisa sufficient cause of y in all contexts in K satisfying X=ZA yp. More 
precisely, 


= If € Kand(M,i) E X = #Ay, then there exists a conjunct X = x of X = # 
and a (possibly empty) conjunction Y = y such that X = x A Y= y is a cause 
of y in (M, i). (This is essentially condition SC2 from the definition of sufficient 
cause, applied to all contexts u € K¢_; eo Note that SCI is guaranteed to hold 
in all contexts in Ky 


\ 
ng?) 
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= 


= (M,i’) E [X < @y for all contexts @ € K. (This is just the sufficiency 
condition SC3, restricted to the contexts in KC.) 


EX2. X is minimal; there is no strict subset X’ of X such that X’ = 7’ satisfies EX1, where 
Z’ is the restriction of 7 to the variables in X. (This is just SC4.) 


EX3. Ky_, ib # (). (This just says that the agent considers possible a context where the 
explanation holds.) ff 


The explanation is nontrivial if it satisfies in addition 


= 


EX4. (M, i’) — 7(X = &) for some ti’ € Ky. (The explanation is not already known given 
the observation of vy.) If 


Most of the conditions here depend only on Ky, the set of contexts that the agent considers 
possible after discovering y. The set K plays a role only in the second clause of EX1 (the 
analogue of SC3); it determines the set of contexts for which the condition [x + |p must 
hold. 

The minimality requirement EX2 essentially throws out parts of the explanation that are 
already known. Thus, we do not get an ontic explanation. The nontriviality requirement 
EX4 (which I view as optional) says that explanations that are known do not count as “real” 
explanations—the minimality requirement as written does not suffice to get rid of trivial ex- 
planations. This seems reasonable for epistemic explanations. 

EX4 may seem incompatible with linguistic usage. For example, say that someone ob- 
serves y, then discovers some fact A and says, “Aha! That explains why y happened.” That 
seems like a perfectly reasonable utterance, yet A is known to the agent when he says it. A 
is a trivial explanation of y (one not satisfying EX4) relative to the epistemic state after A 
has been discovered. I think of K, as the agent’s epistemic state just after y is discovered 
but before A is discovered (although nothing in the formal definition requires this). A may 
well be a nontrivial explanation of ¢ relative to the epistemic state before A was discovered. 
Moreover, even after an agent discovers A, she may still be uncertain about how A caused 
yy; that is, she may be uncertain about the causal model. This means that A is not, in fact, a 
trivial explanation once we take all of the agent’s uncertainty into account, as the more general 
definition of explanation will do. 


Example 7.1.2 Consider again the forest-fire example (Example 2.3.1). In the notation in- 
troduced in Section 2.1, consider the following four contexts: in uo = (0,0), there is no 
lightning and no arsonist; in u; = (1,0), there is only lightning; in u2 = (0,1), the arsonist 
drops a match but there is no lightning; and in uz = (1, 1), there is lightning and the arsonist 
drops a match. In the disjunctive model M%, if K, = {ug,u1,u2,u3}, then both L = 1 
and MD = 1 are explanations of F'F' = 1 relative to K, according to all variants of the HP 
definition. 

Going on with the example, £ = 1 and MD = 1 are also both explanations of FF = 1 
relative to Cy = {ug, uy, U2} and K3 = {ug, uz, uz}. However, in the case of K3, L = lisa 
trivial explanation; it is known once the forest fire is discovered. 
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By way of contrast, in the conjunctive forest-fire model M°, relative to Kj, the only ex- 
planation of the fire is L = 1A MD = 1. Due to the sufficiency requirement, with uo, uy, 
and us in Ky, neither L = 1 nor MD = 1 is by itself an explanation of the fire. However, 
L=1A MD = 1 isatrivial explanation. If the agent is convinced that the only way that the 
forest fire could start is through a combination of lightning and a dropped match (it could not, 
for example, have been due to unattended campfires), then once the forest fire is observed, he 
knows why it happened; he does not really need an explanation. 

However, with respect to K4 = {u1,us3}, MD = 1 is an explanation of the forest fire, 
although a trivial one. Since L = 1 is already known, MD = 1 is all the agent needs to 
know to explain the fire, but since both MD = 1 and L = 1 are required for there to be a 
fire, the agent knows that 1/D = 1 when he see the fire. Note that MD = 1 A L = 1 is not 
an explanation; it violates the minimality condition EX2. L = 1 is not an explanation either, 
since (M°,u,) - 7[L < 1](FF = 1), so sufficient causality does not hold. 


This example already suggests why we want to consider sufficient causality in the context 
of explanation. We would not typically accept lightning as an explanation of the forest fire 
relative to either Cy or K4.. 


Example 7.1.3 Now consider the voting scenario with 11 voters discussed in Section 2.1. 
Let K, consist of all the 2'' contexts describing the possible voting patterns. If we want an 
explanation of why Suzy won relative to 1, then any subset of six voters voting for Suzy 
is one. Thus, in this case, the explanations look like the causes according to the modified 
HP definition. But this is not the case in general, as shown by the discussion above for the 
disjunctive model of the forest fire. If Kz consists of all the 2° contexts where voters | to 
5 vote for Suzy, then each of V; = 1 is an explanation of Suzy’s victory, for 2 = 6,..., 11. 
Intuitively, if we already know that the first five voters voted for Suzy, then a good explanation 
tells us who gave her the sixth vote. ff 


Although I have used the word “explanation”, the explanations here should be thought 
of as potential or possible explanations. For example, when considering the explanation for 
Suzy’s victory relative to K2, where it is known that voters 1 to 5 voted for Suzy, calling voter 
6 an explanation doesn’t mean that, in fact, voter 6 voted for Suzy. Rather, it means that it is 
possible that voter 6 voted for Suzy and, if she did, then her vote was sufficient to give Suzy 
the victory. 

As with causes, Definition 7.1.1 disallows disjunctive explanations. However, disjunctive 
explanations seem to make perfect sense here, particularly since we are thinking in terms of 
possible explanations. It seems quite reasonable to say that Suzy’s victory is explained by the 
fact that either voter 6 or voter 7 or ... voter 11 voted for her; we need further investigation 
to determine which voter it actually was. In this case, we can give a reasonable definition 
of disjunctive explanations; say that x 1=%\V... x k = &, is an explanation of y if each 
of = = 7, is an explanation of y, according to Definition 7.1.1. Disjunctive explanations 
defined in this way play a role when it comes to talking about the quality of an explanation, a 
point I return to in the next section. I conclude this section with one more example. 


Example 7.1.4 Suppose that there was a heavy rain in April and electrical storms in the 
following two months; and in June there was a forest fire. If it hadn’t been for the heavy rain 
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in April, given the electrical storm in May, the forest would have caught fire in May (and not in 
June). However, given the storm, if there had been an electrical storm only in May, the forest 
would not have caught fire at all; if there had been an electrical storm only in June, it would 
have caught fire in June. The situation can be captured using a model with five endogenous 
binary variables: 


« AS for “April showers”, where AS = 0 if it did not rain heavily in April and | if it did; 


= E'S'y, for “electric storms in May” and ES’; for “electrical storm in June” (with value 
0 if there is no storm in the appropriate month, and 1 if there is); 


= and FF yj, for “fire in May” and FF’; for fire in June (with value 0 if there is no fire in 
the appropriate month, and 1 if there is). 


The equations are straightforward: the value of AS, ES 7, and ES y¢ is determined by the 
context; FF), = 1 exactly if ESjyy = 1 and AS = 0; FF; = 1 exactly if ES = 1 and 
either AS = 1 or ES yy = 0 (or both). 

Identifying a context with a triple (7, 7, &) consisting of the values of AS, ES y,, and ES';, 
respectively, let Ko consist of all eight possible contexts. What are the explanations of the 
forest fire (FF yy = 1V FF 7 = 1) relative to Ko? It is easy to see that the only sufficient causes 
of fire are ES 7 = land ES: yy = 1A AS = 0, and these are in fact the only explanations of the 
fire relative to Ko. If we are instead interested in an explanation of a fire in June (FF 7 = 1), 
then the only explanations are AS = 1A ES'7 = Land ESyy = OA ES; = 1. All the variants 
of the HP definition agree in these cases. 

This seems to me quite reasonable. The fact that there was a storm in June does not 
explain the fire in June, but the storm in June combined with either the fact that there 
was no storm in May or the fact that there were April showers does provide an explana- 
tion. It also seems reasonable to consider the disjunctive explanation (ZS; = 1A AS = 1)V 
(ES 7 =1A ES yy = 0). 

Now suppose that the agent already knows that there was a forest fire, so is seeking an ex- 
planation of the fire with respect to 1, the set consisting of the five contexts compatible with 
a fire, namely, (0, 1,0), (0,0,1), (0, 1,1), (1,0,1), and (1,1, 1). In this case, the situation is 
a bit more complicated. The sufficient causality clause holds almost trivially, since there is a 
forest fire in all contexts under consideration. Here is a summary of the situation: 


» AS = 1 is not an explanation of the fire according to any variant of the HP definition, 
since it is not part of a cause in contexts (1,0, 1) or (1, 1,1). 


« AS = (is not an explanation of the fire according to the updated and modified HP 
definitions, since it is not a cause of fire (FF yy = 1V FF 7 = 1) in the context (0, 0, 1). 
However, it is a cause of the fire in this context according to the original HP definition 
(consider the contingency ES yy = 1A ES; = 0). (Arguably, the original HP definition 
is giving an inappropriate answer in this case.) It easily follows that AS = 0 is an 
explanation of FF yy = 1V FF 7 = 1 relative to K, according to the original HP 
definition. 


ES; = 1 is an explanation of the forest fire relative to K, according to all the 
variants of the HP definition. It is not hard to check that E'S; = 1 is a cause of 
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FFy =1V FF; =1in(M, 1) for all contexts u € Ky such that (M,u) F BS; =1 
according to the original and updated HP definition, and is part of a cause in all these 
contexts according to the modified HP definition. 


ES, = 1is not an explanation of FF yy = 1V FF 7 = 1in M relative to K according to 
the updated and modified HP definitions but is an explanation according to the original 
HP definition. Consider the context (1,1,1). I leave it to the reader to check that 
taking W = (AS, ES';) and setting W = 0 shows that the May storm is a cause of 
the fire according to the original HP definition, but this is not a witness in the case 
of the updated HP definition. (Again, the original HP definition is arguably giving an 
inappropriate answer in this case.) 


ES\y =1A AS = Ois an explanation of FF yy = 1V FF, = 1 in M relative to Ky, 
according to the updated and modified HP definition; a conjunct of BS jy = 1A AS = 0, 
namely ES yy = 1, is part of a cause of F'F'yy = 1V FF; = 1 inevery context satisfying 
ESv = 1A AS = 0. Moreover, the minimality condition EX2 is easily seen to hold 
(since neither ES 47 = 1 nor AS = 0 is an explanation, as we have seen). 


Now suppose that we are looking for an explanation of the June fire relative to K. 


As before, it is easy to check that HS; = 1 is not an explanation of F'F'; = 1 because 
it is not a sufficient cause of FF’; = 1: it violates the second clause of EX1, since in 
the context (0, 1,0), [ES < 1](£F ; = 1) does not hold. 


« AS = 1 is not an explanation because it is not a sufficient cause (again, the second 
clause of EX1 is violated in the context (0, 1, 0)). 


ES = 1 is not an explanation because it is not a sufficient cause (yet again, the second 
clause of EX1 is violated in the context (0, 1, 0)). 


ES yz = 0 is not an explanation because it is not a sufficient cause (yet again, consider 
the context (0, 1,0)). 


ESy=0A ES; =1and AS = 1A ES; = 1 are both explanations, according to all 
variants of the HP definition. 


Thus, although we have different explanations of fire relative to Ko and kK, the explanations 
of the June fire are the same. §f 


7.2 Partial Explanations and Explanatory Power 


Not all explanations are considered equally good. There are different dimensions of “good- 
ness”, such as simplicity, generality, and informativeness. I focus on one aspect of “goodness” 
here: likelihood (without meaning to suggest that the other aspects are unimportant!). To cap- 
ture the likelihood of an explanation, it is useful to bring probability into the picture. Suppose 
that the agent has a probability Pr on the set K of possible contexts. Recall that I think of K 
as the set of contexts that the agent considers possible before observing the explanandum y; 
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Pr can be thought of as the agent’s prior probability on K. One obvious way of defining the 
goodness of X = @asan explanation of y relative to K is to consider the probability of the 
set of contexts where X = Z is true or, as has been more commonly done, to consider this 
probability conditional on yp. 


This is certainly a useful notion of goodness. For example, going back to the forest-fire ex- 
ample and taking the set of contexts to be Kz = {uo, ui, ug}, the agent may consider lightning 
more likely than an arsonist dropping a match. In this case, intuitively, we should consider 
L = 1 a better explanation of the fire than MD = 1; it has a higher probability of being 
the actual explanation than 1/D = 1. (What I have been calling an explanation of yp should 
really be viewed as a possible explanation of _p; this definition of goodness provides one way 
of evaluating the relative merits of a number of possible explanations of vy.) Of course, if we 
allow disjunctive explanations, then L = 1 V MD = 1 is an even better explanation from 
this viewpoint; it has probability 1 conditional on observing the fire. But this already points 
to an issue. Intuitively, L = 1 V MD = 11s a less informative explanation. I would like to 
capture this intuition of “informativeness”, but before doing so, it is useful to take a detour 
and consider partial explanations. For most of the discussion that follows, I do not consider 
disjunctive explanations; I return to them at the end of the section. 


Example 7.2.1 Suppose that I see that Victoria is tanned and I seek an explanation. The 
causal model includes the three variables “Victoria took a vacation in the Canary Islands”, 
“sunny in the Canary Islands”, and “went to a tanning salon’; a context assigns values to 
just these three variables. That means that there are eight contexts, depending on the values 
assigned to these three variables. Before seeing Victoria, I consider all eight of them possible. 
Victoria going to the Canaries is not an explanation of Victoria’s tan, for two reasons. First, 
Victoria going to the Canaries is not a sufficient cause of Victoria getting a tan. There are con- 
texts where it is not sunny in the Canaries and Victoria does not go to the tanning salon; going 
to the Canaries in those contexts would not result in her being tanned. And even among con- 
texts where Victoria is tanned and goes to the Canaries, going to the Canaries is not the cause 
of her getting the tan in the context where it is not sunny and she goes to the tanning salon, 
at least according to the modified and updated HP definitions. (According to the original HP 
definition, it is a cause; this example has the same structure as Example 2.8.1, where A does 
not load B’s gun, although B shoots. And just as with that example, the original HP seems to 
be giving an inappropriate causality ascription here. The problem can be resolved in the same 
way as Example 2.8.1, by adding an extra variable for “Victoria vacationed in a place that was 
sunny and warm”; then going to the Canaries is not a cause of Victoria’s tan even according to 
the original HP definition.) Nevertheless, most people would accept “Victoria took a vacation 
in the Canary Islands” as a satisfactory explanation of Victoria being tanned. Although EX1 
is not satisfied, intuitively, it is “almost” satisfied, especially if we assume that the Canaries 
are likely to be sunny. The problematic contexts are exactly the ones where it is not sunny in 
the Canaries, and these have low probability. The only complete (ontic) explanations accord- 
ing to Definition 7.1.1 are “Victoria went to the Canary Islands and it was sunny”, “Victoria 
went to the Canary Islands and did not go to a tanning salon”, and “Victoria went to a tanning 
salon”. “Victoria went to the Canary Islands” is a partial explanation (a notion more general 
than just being part of an explanation). ff 
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In Example 7.2.1, the partial explanation can be extended to a complete explanation by 
adding a conjunct. But not every partial explanation as I am about to define it can be extended 
to a complete explanation. Roughly speaking, this is because the complete explanation may 
involve exogenous factors, which are not permitted in explanations. Suppose, for example, 
that we used a different causal model for Victoria, one where the only endogenous variable is 
the one representing Victoria’s vacation; there is no variable corresponding to the weather in 
the Canaries, nor one corresponding to whether Victoria goes to a tanning salon. Instead, the 
model simply has exogenous variables that can make Victoria suntanned even in the absence 
of a trip to the Canaries and leave her not suntanned even if she goes to the Canaries. Arguably 
this model is insufficiently expressive, since it does not capture important features of the story. 
But even in this model, Victoria’s vacation would still be a partial explanation of her suntan: 
the contexts where it fails to be a (sufficient) cause (ones corresponding to no sun in the 
Canary Islands) are fairly unlikely. But note that, in this model, we cannot add conjuncts to 
the event of Victoria going to the Canaries that exclude the “bad” contexts from consideration. 
Indeed, in this model, there is no (complete) explanation for Victoria’s tan. Given the choice 
of exogenous variables, it is inexplicable! 

Intuitively, if we do not have a “name” for a potential cause, then there may not be an 
explanation for an observation. This type of situation arises often, as the following example 
shows. 


Example 7.2.2 Suppose that the sound on a television works but there is no picture. Fur- 
thermore, the only cause of there being no picture that the agent is aware of is the picture 
tube being faulty. (Older non-digital televisions had picture tubes.) However, the agent is 
also aware that there are times when there is no picture even though the picture tube works 
perfectly well—intuitively, there is no picture “for inexplicable reasons”. This is captured 
by the causal network described in Figure 7.1, where T describes whether the picture tube is 
working (1 if it is and 0 if it is not) and P describes whether there is a picture (1 if there is 
and 0 if there is not). The exogenous variable Up determines the status of the picture tube: 


Uo 


P 


Figure 7.1: The television with no picture. 


T = Up. The exogenous variable U; is meant to represent the mysterious “other possible 
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causes”. If U; = 0, then whether there is a picture depends solely on the status of the picture 
tube—that is, P = T.. However, if U; = 1, then there is no picture (P = 0) no matter what 
the status of the picture tube. Thus, in contexts where U; = 1, T’ = 0 is not a cause of P = 0. 
Now suppose that K includes the contexts tion9, where Up = U, = 0, and wj9, where Up = 1 
and U, = 0. The only cause of P = 0 in both tq and wo is P = 0 itself (assuming that 
we do not exclude self-causation). (Note that T = 0 is not a cause of P = 0 in tio, since 
P = (even if T is set to 1.) Indeed, there is no nontrivial explanation of P = 0 relative to 
any epistemic state K that includes Uo9 and wo. However, T = 0 is a cause of P = 0 in 
all contexts in K satisfying T = 0 other than woo and is a sufficient cause of P = 0 in all 
contexts. If the probability of wo9 is low (capturing the intuition that it is unlikely that more 
than one thing goes wrong with a television at once), then we are entitled to view JT’ = Oasa 
quite good partial explanation of P = 0 with respect to K. 

Note that if we modify the causal model by adding an endogenous variable, say J, corre- 
sponding to the “inexplicable” cause U; (with equation J = U;), then T = 0 and I = 0 both 
become explanations of P = 0, according to all variants of the HP definition. JJ 


Example 7.2.2 and the discussion after Example 7.2.1 emphasize the point I made above: 
if there is no name for a cause, then there may be no explanation. Adding an endogenous 
variable that provides such a name can result in there being an explanation when there was 
none before. This phenomenon of adding “names” to create explanations is quite common. 
For example, “gods” are introduced to explain otherwise inexplicable phenomena; clusters of 
symptoms are given names in medicine so as to serve as explanations. 

I now define partial explanation formally. The definition is now relative to a pair (K, Pr), 
just like the definition of blame. Once we have a probability distribution in the picture, it 
makes sense to modify conditions EX3 and EX4 so that they use the distribution. That is, 
EX3 would now say that Pr(K ¢ # O, rather than just Kz # Qj; similarly, EX4 


becomes Pr([X = #] | K,) #1. Lassume that the modified definition is applied whenever 
there is a probability in the picture. 


=zAy) =fAp 


Definition 7.2.3 Given a set K of contexts in a causal model M and a probability Pr on K, 
let K(X = 2, y, SC2) consist of all contexts 7 € Ky_z,,, that satisfy SC2 with respect to K 
(i.e., the first condition in EX1). More precisely, 
K(X =2@,y,8C2)= {ae Kgoaay : there exists a conjunct X = x of X=F 
and a (possibly empty) conjunction Y = y such that 
X =a AY =fFisacause of y in (M,i)}. 


Let x = £,p,SC3) consist of all contexts u € K that satisfy SC3; that is, 


K(X = #,y,SC3) = {@ eK: (M, a) K[X + dy}. 


X = @ isa partial explanation of y with goodness (a, 3) relative to (K,Pr) if X = Z 
is an explanation of ¢ relative to K(X = #,y,SC3) — (Kz EA — K(X = Z,,S8C2)), 


> 


a = Pr(K(X = 2,y,SC2|Kg_.,)), and 8 = Pr(K(X = Z, y,SC3)). I 


X=fAp 
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The goodness of a partial explanation measures the extent to which it provides an explanation 
of y. As I mentioned above, there are two ways that X = &can fail to > provide an explanation: 
there may be some contexts w that satisfy X =A y but X = #is not a cause of y in 


(M, wi) (these are the contexts in K.g_ ing, — K(X = 4,~, SC2)), and there may be contexts 


where the sufficient causality condition [x + Z]y does not hold (these are the contexts in 
K-K(X Y= Z,p,SC3)). The parameter a in the definition of partial explanation measures 
the fraction of the contexts in K ¢ Faery that are “good” in the first sense; the parameter 8 
measures the fraction of contexts in C ‘that are good in the second sense. For the first sense 
of goodness, we care only about which contexts in ¢_;,,, are good, so it makes sense to 


Ag 
condition on K ¢_ ZAp" e X = & is an explanation of y relative to K, then for all probability 
distributions Pr on K, X = Z is a partial explanation of ¢ relative to (K’, Pr) with goodness 


(1,1). 

In Example 7.2.1, if the agent believes that it is sunny in the Canary Islands with probability 
.9, independent of whether Victoria actually goes to the Canaries or the tanning salon, then 
Victoria going to the Canaries is a partial explanation of her being tanned with goodness 
(a, 8), where a > .9 and 8 > .9 (according to the updated and modified HP definition; 
recall that the original HP definition gives arguably inappropriate causality ascriptions in this 
example). To see that a > .9, note the probability of Victoria going to the Canaries being 
a cause of her being tanned is already .9 conditional on her going to Canaries; it is at least 
as high conditional on her going to the Canaries and being tanned. Similarly, 6 > .9, since 
the set of contexts that satisfy SC3 consists of all contexts where it is sunny in the Canary 
Islands together with contexts where Victoria goes to the tanning salon and it is not sunny 
in the Canaries. Similarly, in Example 7.2.2, if the agent believes that the probability of the 
other mysterious causes being operative is .1 conditional on the picture tube being faulty (i.e., 
Pr(too | [LT = O]) = .1), then T = O is a partial explanation of P = 0 with goodness (.9, 1). 

In the literature, there have been many attempts to define probabilistic notions of explana- 
tion, in the spirit of the probabilistic definition of causality discussed in Section 2.5. These 
definitions typically say that the explanation raises the probability of the explanandum. Just is 
the case of causality, rather than dealing with probabilistic explanations, we can use notions 
like partial causality, in the spirit of the following example. 


Example 7.2.4 Suppose a Geiger counter is placed near a rock and the Geiger counter clicks. 
What’s an explanation? Assume that the agent does not know that the rock has uranium, and 
thus is somewhat radioactive, but considers it possible, and understands that being near a 
radioactive object causes a Geiger counter to click with some probability less than 1. Clearly 
being near the radioactive object raises the probability of the Geiger counter clicking, so it 
becomes an explanation according to the probabilistic accounts of explanation. If the Geiger 
counter necessarily clicks whenever it is placed near a radioactive object, then Definition 7.1.1 
would declare being near a radioactive object the explanation without difficulty. But since the 
clicking happens only with some probability 6 < 1, there must be some context that the 
agent considers possible where the Geiger counter would not click, despite being placed near 
the rock. Thus, the second half of EX1 does not hold, so being put close to the rock is not 
an explanation of the Geiger counter clicking according to Definition 7.1.1. However, it is 
a partial explanation, with goodness (1, 3): moving the Geiger counter close to the rock is 
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an explanation of it ticking relative to the set of contexts where it actually does tick, and the 
probability of such a context, by assumption, is 3. If 


My sense is that most explanations are partial, although if a and £ are close to 1, we call 
them “explanations” rather than “partial explanations”. The partiality may help explain why 
explanations attenuate over longer chains: even though, for each link, the parameters a and 
2 may be close to 1, over the whole chain, a and / can get close to 0. For example, while 
Sylvia’s depression might be viewed as a reasonably good explanation of her depression, and 
Sylvia’s genetic makeup might be viewed as a reasonably good explanation of her depression, 
Sylvia’s genetic makeup might not be viewed as a particularly good explanation of her suicide. 

Although the parameters a and £ in the notion of partial explanation do give a good sense 
of the quality of an explanation, they do not tell the whole story, even if our notion of “good- 
ness” is concerned only with likelihood. For one thing, as I mentioned earlier, we are inter- 
ested in the probability of the explanation or, even better, the probability of the explanation 
conditional on the explanandum (i.e., Pr(X = Z| [y])). But there may be a tension between 
these measures. For example, suppose that, in the disjunctive forest-fire model, we add an- 
other context u4 where there is no lightning, the arsonist drops a match, but there is no forest 
fire (suppose that, in w4, someone is able to stamp out the forest fire before it gets going). For 
definiteness, suppose that Pr(u;) = .1, Pr(w2) = .8, and Pr(w4) = .1, so the probability of 
having only lightning is .1, the probability of the arsonist dropping the match and starting the 
forest fire is .8, and the probability of the arsonist dropping a match and there being no fire is 
.1. In this case, the lightning is an explanation of the fire, and thus is a partial explanation with 
goodness (1,1), but it has probability only 1/9 conditional on there being a fire. Although 
the arsonist is not an explanation of the fire, his dropping a match is a partial explanation of 
the fire with goodness (1, .9), and the arsonist has probability 8/9 conditional on there being 
a fire. 

As if that wasn’t enough, there is yet another measure of goodness that people are quite 
sensitive to. Suppose that we are scientists wondering why there is a fire in the chemistry lab. 
We suspect an arsonist. But suppose that we include in the model an endogenous variable O 
representing the presence of oxygen. Ignoring normality considerations, O = 1 is certainly a 
cause of the fire in all contexts, and it holds with probability 1. It is a trivial explanation (i.e., 
it does not satisfy EX4), so that perhaps might be a reason to exclude it. But now suppose 
that we add an extremely unlikely context where O = 0 and yet there is a fire (perhaps it is 
possible to evacuate the oxygen from the lab, yet still create a fire using some other oxidizing 
substance). In that case, O = 1 is still an explanation of the fire, and it has high probability 
conditional on there being a fire. Nevertheless, it is an explanation with, intuitively, very little 
explanatory power. We would typically prefer the explanation that an arsonist started the fire, 
even though this might not have such high probability. How can we make this precise? 

Here it is useful that we started with a set K of contexts that can be viewed as the contexts 
that the agent considers possible before making the observation (in this case, before observing 
the fire), and that we view Pr as intuitively representing the agent’s “pre-observation” prior 
probability. Roughly speaking, I would like to define the explanatory power of a (partial) 
explanation X = Z as Pr([y] | [xX = @]). This describes how likely ~ becomes on learning 
X = i. Taking LF to stand for lab fire, if lab fires are not common, Pr([LF = 1] | [O = 1]) 
is essentially equal to Pr(LF = 1]) and is not so high. Thus, oxygen has low explanatory 
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power for the fire in the lab. In contrast, Pr([ZF = 1] | [MD = 1]) is likely to be quite high; 
learning that there is an arsonist who dropped a match makes the lab fire much more likely. 
Although this definition captures some important intuitions, it is not quite what we want. 
The problem is that it confounds correlation with causation. For example, according to this 
definition, the barometer falling would have high explanatory power for rain (although it is 
not a cause of rain). The following definition replaces the [y] in Pr([y] | X = Z) by 


=> 


K(X = Z,p,SC2), the set of contexts where a conjunct of X = Z is part of a cause of yp. 


Definition 7.2.5 The explanatory power of the partial explanation X= for ¢p relative to 
(K, Pr) is Pr(K(X = Z, y,SC2) | |X = Z]). I 


If X = isan explanation of y (rather than just being a partial explanation of ~), then 
K(X = @,~,SC2) = [X = #A yg], and Definition 7.2.5 agrees with the original informal 
definition. The difference between the two definitions arises if there are contexts where y and 
X = both happen to be true, but X = Zis nota cause of yy. In Example 7.2.1, the context 
where Victoria went to the Canary Islands, it was not sunny, but Victoria also went to the 
tanning salon, and so got tanned is one such example. Because of this difference, according 
to Definition 7.2.5, a falling barometer has 0 explanatory power for explaining the rain. Even 
though the barometer falls in almost all contexts where it rains (there may be some contexts 
where the barometer is defective, so it does not fall even if it rains), the barometer falling is 
not a cause of the rain in any context. Making the barometer rise would not result in the rain 
stopping! 

There is a tension between the explanatory power, the goodness, and the probability of 
a partial explanation. As we have seen, O = 1 has low explanatory power for LF’ = 1, 
although it has high probability conditional on the fire and has goodness (1, 1) (or close to 
it, if there is a context where there is a fire despite the lack of oxygen). For another ex- 
ample, in Example 7.1.4, as we observed, HS; = 1 is not an explanation of the forest fire 
in June (FF; = 1) relative to Kp. The problem is the context (0,1,1), which does not 
satisfy [ES; < 1](FFz = 1), so the second half of EX1 (i.e., SC3) does not hold. If 
Pr((0,1,1)) = p and Pr([ES = 1]) = q, then the explanatory power of ES = lis 4*, 
the probability of £.S.; = 1 conditional on there being a fire in June is 1, and ES; = 1 
is a partial explanation of the fire in June with goodness (1,1 — p). By way of contrast, 
AS =1A ES = 1 has goodness (1, 1) and explanatory power 1, but may have much lower 
probability than ES 7 = 1 conditional on there being a fire in June. 

There is no obvious way to resolve the tension between these various measures of the 
goodness of an explanation. A modeler must just decide what is most important. Things get 
even worse if we allow disjunctive explanations. The probability of a disjunctive explanation 
will be higher than that of its disjuncts, so according to that metric, disjunctive explanations 
are better. What about the goodness of a disjunctive partial explanation? It is not quite clear 
how to define this. Perhaps the most natural way of defining goodness would be to take its 
goodness to be the maximum goodness of each of its disjuncts, but if there are two disjuncts, 
one of which has goodness (.8,.9), while the other has goodness (.9,.8), which one has 
maximum goodness? 

There is an intuition that a disjunctive explanation has less explanatory power than its 
disjuncts. This intuition can be captured in a natural way by taking the explanatory power of 
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the disjunction X,=2,V...V Xp = Ex for y relative to (K, Pr) to be 


max Pr(K(X = £;,9,8C2)|[X, =, V...V X, = F]). (7.1) 


Thus, rather than taking the explanatory power of the disjunction x 1=2,V...V Xp = 
to be Pr(K(X1 = #, V... Xp = #e), vy, SC2) | [X = ZV... VX = Zp]), which would 
be the obvious generalization of Definition 7.2.5, what (7.1) says is that, for each disjunct 
.¢ = ;, we consider how likely it is that x. = £; is a cause of ~ conditional on learning 
the disjunction, and identify the explanatory power of the disjunction with the maximum 
likelihood, taken over all disjuncts. 

The notion of explanatory power helps explain why we are less interested in trivial expla- 
nations, even though they have high probability conditional on the explanandum (probability 
1, in fact). They typically have worse explanatory power than nontrivial explanations. 

These notions of goodness could be refined further by bringing normality into the picture. 
However, I have not yet explored how to best do this. 


7.3. The General Definition of Explanation 


In general, an agent may be uncertain about the causal model, so an explanation will have 
to include information about it. It is relatively straightforward to extend Definition 7.1.1 to 
accommodate this provision. Now an epistemic state K consists not only of contexts, but 
of causal settings, that is, pairs (JM, 7) consisting of a causal model M and a context w. 
Intuitively, now an explanation should consist of some causal information (such as “prayers 
do not cause fires”) and some facts that are true. 

I assume that the causal information in an explanation is described by a formula w in the 
causal language. For example, 7 could say things like “prayers do not cause fires”, which 
corresponds to the formula (F = 0) > [P < 1](F' = 1), where P is a variable describing 
whether prayer takes place and Ff’ = 0 says that there is no fire; that is, if there is no fire, then 
praying won’t result in a fire. I take a (general) explanation to have the form (4, X= £), 
where ~ is an arbitrary formula in the causal language and, as before, X =Zisa conjunction 
of primitive events. The first component in a general explanation restricts the set of causal 
models. Recall from Section 5.4 that ~ is valid in causal model M/, written M FE w, if 
(M, uv) —& w for all contexts w. 

I now extend the Definition 7.1.1 to the more general setting where K consists of arbitrary 
causal settings, dropping the assumption that the causal model is known. Conditions EX3 
and EX4 remain unchanged; the differences arise only in EX1 and EX2. Let M(K) = {M: 
(M, tw) € K for some u}. 


Definition 7.3.1 (v, X= £) is an explanation of ¢p relative to a set K of causal settings if 
the following conditions hold: 


EX1. X = Zisa sufficient cause of y in all causal settings (7, w) € K satisfying X=2A ) 
such that IZ = w. More precisely, 
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= If (M,i) € K, (M,@) K X = AQ, and M & %, then there exists a con- 
junct X = 2 of X = Zanda (possibly empty) conjunction Y= y such that 
X=a2AY =isacause of y in (M, i). 


= (M, i’) E [X < Z]y for all causal settings (M, i’) € K such that M / y. 


EX2. (yw, X = 2) is minimal; there is no pair (W’,X’ = #") satisfying EX1 such that 
either {M"” € M(K) : M” —& wW’} > {M” © M(K) : M” & w} (where 
“>” denotes strict superset), x! Cc x, and 2’ is the restriction of x to x or 
{M" € M(K):M" Ew} = {M” © M(K) : M” & y}, X’ CX, X' CX, 
and £’ is the restriction of Z to X’. Roughly speaking, this says that no subset of x 
provides a sufficient cause of y in more contexts than X does, and no strict subset of 
4 provides a sufficient cause of y in the same set of contexts as X does. 


EXO: Keg aay FO 


The explanation is nontrivial if it satisfies in addition 


EX4. (M, i) K 7(X = 2) for some (M, i) € K or M KF a for some M € M(K). Bl 


Note that in EX1, the sufficient causality requirement is restricted to causal settings (IM, w) € 
K such that both M = wand (M, i) | X = @. Although both components of an explanation 
are formulas in the causal language, they play different roles. The first component serves to 
restrict the set of causal models considered to those with the appropriate structure; the second 
describes a cause of ¢ in the resulting set of situations. 

Clearly Definition 7.1.1 is the special case of Definition 7.3.1 where there is no uncertainty 
about the causal structure (i.e., there is some M such that if (M’, uv) € K, then M = M’). In 
this case, it is clear that we can take w in the explanation to be true. 


Example 7.3.2 Paresis develops only in patients who have been syphilitic for a long time, 
but only a small number of patients who are syphilitic in fact develop paresis. For simplicity, 
suppose that no other factor is known to be relevant in the development of paresis. This 
description is captured by a simple causal model Mp. There are two endogenous variables, 
S (for syphilis) and P (for paresis), and two exogenous variables, U;, the background factors 
that determine S$, and U2, which intuitively represents “disposition to paresis”, that is, the 
factors that determine, in conjunction with syphilis, whether paresis actually develops. An 
agent who knows this causal model and that a patient has paresis does not need an explanation 
of why: he knows without being told that the patient must have syphilis and that Uz = 1. In 
contrast, for an agent who does not know the causal model (i.e., considers a number of causal 
models of paresis possible), (wp, S = 1) is an explanation of paresis, where wp is a formula 
that characterizes Mp. JJ 
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Notes 


Although issues concerning scientific explanation have been discussed by philosophers 
for millennia, the mainstays of recent discussion have been Hempel’s [1965] deductive- 
nomological model and Salmon’s [1970] statistical relevance model. Woodward [2014] pro- 
vides a good recent overview of work in philosophy on explanation. Although many have 
noted the connection between explanation and causation, there are not too many formal defi- 
nitions of explanation in terms of causality in the literature. In particular, the definitions given 
by Hempel and Salmon do not really involve causality. In later work, Salmon [1984] did try 
to bring causality into his definitions, using the so-called process theories of causation [Dowe 
2000], although he tried to do so without using counterfactuals. Gardenfors [1980, 1988] and 
Van Fraassen [1980], among others, give definitions of probabilistic explanation that involves 
the explanation raising the probability of the explanandum. As I said in the main text, Salmon 
[1984] introduced the notions of ontic and epistemic explanation. 

Lewis [1986a] does relate causality and explanation; he defends the thesis that “to explain 
an event is to provide some information about its causal history”. Although this view is 
compatible with the definition given here, there is no formal definition given to allow for a 
careful comparison between the approaches. Woodward [2003] clearly views explanation as 
causal explanation and uses a theory of causality based on structural equations. But he does 
not provide different definitions of causality and explanation, and he does not have a notion 
of explanation that depends on epistemic state. As a consequence, he has no analogues to the 
various notions of goodness of explanation considered here. 

The definition of explanation given here, as well as much of the material in this chapter, 
is based on material in [Halpern and Pearl 2005b]. However, there are some significant dif- 
ferences between the definition given here and that in [Halpern and Pearl 2005b]; perhaps 
the most significant is the requirement of sufficient causality here (essentially, SC3). As I 
argued in Section 7.1, requiring sufficient causality seems to give a notion of explanation that 
corresponds more closely to natural language usage. 

The idea that an explanation should be relative to an agent’s epistemic state is discussed 
at length by Gardenfors [1988]. The idea appears in Gardenfors’ earlier work [Gardenfors 
1980] as well; it is also explicit in Salmon’s [1984] notion of epistemic explanation. The 
observation that explanations may include (fragments of) a causal model is due to Gardenfors 
[1980, 1988], as is the example of Mr. Johanssen. Hempel [1965] also observed that an 
explanation will have to be relative to uncertainty involving not just the context but also the 
causal model. However, neither Gardenfors nor Hempel explicitly incorporated causality into 
their definitions, focusing instead on statistical and nomological information (i.e., information 
about basic physical laws). 

Example 7.1.4 is due to Bennett (see [Sosa and Tooley 1993, pp. 222—223]). The analysis 
follows along the lines of the analysis in [Halpern and Pearl 2005b]. Example 7.2.1 is due to 
Gardenfors [1988]. He pointed out that we normally accept “Victoria took a vacation in the 
Canary Islands” as a satisfactory explanation of Victoria being tanned; indeed, according to 
his definition, it is an explanation. 

The notion of partial explanation defined here is related to, but different from, that of 
Chajewska and Halpern [1997]. Giardenfors identifies the explanatory power of the (par- 
tial) explanation X = £ of y with Pr(y | X = 2) (see [Chajewska and Halpern 1997; 
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=> 


Gardenfors 1988]). More precisely, Gardenfors take the explanatory power of X=7 
to be Pr([y] | [x = Z]) — Pr([y]). As far as comparing the explanatory power of two 
explanations for y, it suffices to consider just Pr([y] | [X = Z]), since the Pr([y]) 
term appears in both expressions. Chajewska and Halpern [1997] argued that the quotient 
Pr([y] | LX = Z])/ Pr([y]) gives a better measure of explanatory power than the difference, 
but the issues raised by Chajewska and Halpern are somewhat orthogonal to the concerns of 
this chapter. 

The dominant approach to explanation in the Artificial Intelligence literature is the max- 
imum a posteriori (MAP) approach; see, for example, [Henrion and Druzdzel 1990; Pearl 
1988; Shimony 1991] for a discussion. The MAP approach focuses on the probability of the 
explanation, that is, Pr([X = 2] | [y]). It is based on the intuition that the best explanation 
for an observation is the state of the world (in the language of this book, the context) that is 
most probable given the evidence. Although this intuition is quite reasonable, it completely 
ignores the issue of explanatory power, an issue that people are quite sensitive to. The formula 
O = (has high probability in the lab fire example, even though most people would not con- 
sider the presence of oxygen a satisfactory explanation of the fire. To remedy this problem, 
more intricate combinations of the quantities Pr([X = Z] | [y]), Pr([y] | LX = Z]), and 
Pr([y]]) have been suggested to quantify the causal relevance of X = to y, but, as Pearl 
[2000, p. 221] argues, without taking causality into account, it seems that we cannot capture 
explanation in a reasonable way. 

Fitelson and Hitchcock [2011] discuss a number of probabilistic measures of causal 
strength and the connections between them. Almost all of these can be translated to prob- 
abilistic measures of “goodness” of an explanation as well. The notions of partial explanation 
and explanatory power that I have discussed here certainly do not exhaust the possibilities. 
Schupbach [2011] also considers various approaches to characterizing explanatory power in 
probabilistic terms. 

Example 7.3.2 is due to Scriven [1959]. Apparently there are now other known factors, but 
this does not change the import of the example. 
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Applying the Definitions 


Throughout history, people have studied pure science from a desire to understand 
the universe rather than practical applications for commercial gain. But their 
discoveries later turned out to have great practical benefits. 


Stephen Hawking 


In Chapter 1, I said that my goal was to get definitions of notions like causality, responsibility, 
and blame that matched our natural language usage of these words and were also useful. Now 
that we are rapidly approaching the end of the book, it seems reasonable to ask where we 
stand in this project. 

First some good news: I hope that I have convinced most of you that the basic approach 
of using structural equations and defining causality in terms of counterfactuals can deal with 
many examples, especially once we bring normality into the picture. Moreover, the frame- 
work also allows us to define natural notions of responsibility, blame, and explanation. These 
notions do seem to capture many of the intuitions that people have in a natural way. 

Next, some not-so-good news: One thing that has become clear to me in the course of 
writing this book is that things haven’t stabilized as much as I had hoped they would. Just 
in the process of writing the book, I developed a new definition of causality (the modified 
HP definition) and a new approach to dealing with normality (discussed in Section 3.5), and 
modified the definitions of responsibility, blame, and explanation (see the notes at the end of 
Chapters 6 and 7). Although I think that the latest definitions hold up quite well, particularly 
the modified HP definition combined with the alternative notion of normality, given that the 
modified HP definition is my third attempt at defining causality, and that other researchers 
continue to introduce new definitions, I certainly cannot be too confident that this is really the 
last word. Indeed, as I have indicated at various points in the book, there are some subtleties 
that the current definition does not seem to be capturing quite so well. To my mind, what 
is most needed is a good definition of agent-relative normality, which takes into account the 
issues discussed in Example 6.3.4 (where A and B can flip switches to determine whether C’ 
is shocked) and meshes well with the definition of causality, but doubtless others will have 
different concerns. 


203 
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Moreover, when we try to verify experimentally the extent to which the definitions that we 
give actually measure how people ascribe causality and responsibility, the data become messy. 
Although the considerations discussed in Chapter 6 (pivotality, normality, and blame assign- 
ments) seem to do quite a good job of predicting how people will ascribe causal responsibility 
at a qualitative level, because all these factors (and perhaps others) affect people’s causality 
and responsibility judgments, it seems that it will be hard to design a clean theory that com- 
pletely characterizes exactly how people acribe causality, responsibility, and blame at a more 
quantitative level. 

So where does this leave us? I do not believe that there is one “true” definition of causality. 
We use the word in many different, but related, ways. It is unreasonable to expect one defini- 
tion to capture them all. Moreover, there are a number of closely related notions—causality, 
blame, responsibility, intention—that clearly are often confounded. Although we can try to 
disentangle them at a theoretical level, people clearly do not always do so. 

That said, I believe that it is important and useful to have precise formal definitions. To 
take an obvious example, legal judgments depend on causal judgments. A jury will award 
a large settlement to a patient if it believes that the patient’s doctor was responsible for an 
inappropriate outcome. Although we might disagree on whether and the extent to which 
the doctor was responsible, the disagreement should be due to the relevant facts in the case, 
not a disagreement about what causality and responsibility mean. Even if people confound 
notions like causality and responsibility, it is useful to get definitions that distinguish them 
(and perhaps other notions as well), so we can be clear about what we are discussing. 

Although the definition(s) may not be able to handle all the subtleties, that is not necessary 
for them to be useful. I have discussed a number of different approaches here, and the jury 
is still out on which is best. The fact that the approaches all give the same answers in quite a 
few cases makes me even more optimistic about the general project. 

I conclude the book with a brief discussion of three applications of causality related to 
computer science. These have the advantage that we have a relatively clear idea of what 
counts as a “good” definition. The modeling problem is also somewhat simpler; it is relatively 
clear exactly what the exogenous and endogenous variables should be, and what their ranges 
are, so some of the modeling issues encountered in Chapter 4 no longer arise. 


8.1 Accountability 


The dominant approach to dealing with information security has been preventative: we try to 
prevent someone from accessing confidential data or connecting to a private network. More 
recently, there has been a focus on accountability: when a problem occurs, it should be pos- 
sible after the fact to determine the cause of the problem (and then punish the perpetrators 
appropriately). Of course, dealing with accountability requires that we have good notions of 
causality and responsibility. 

The typical assumption in the accountability setting, at least for computer science applica- 
tions, is that we have logs that completely describe exactly what happened; there is no uncer- 
tainty about what happened. Moreover, in the background, there is a programming language. 
An adversary writes a program in some pre-specified programming language that interacts 
with the system. We typically understand the effect of changing various lines of code. The 
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variables in the program can be taken to be the endogenous variables in the causal model. 
The programs used and the initial values of program variables are determined by the context. 
Changing some lines of code becomes an intervention. The semantics of the program in the 
language determines the causal model and the effect of interventions. 

The HP definitions of causality and responsibility fit this application quite well. Of course, 
if some logs are missing, we can then move to a setting where there is an epistemic state, with 
a probability on contexts; the contexts of interest are the ones that would generate the portion 
of the log that is seen. If we are interested in a setting where there are several agents working 
together to produce some outcome, blame becomes even more relevant. Agents may have 
(possibly incorrect) beliefs about the programs that other agents are using. Issues of which 
epistemic state to use (the one that the agent actually had or the one that the agent should have 
had) become relevant. But the general framework of structural equations and the definitions 
provided here should work well. 


8.2 Causality in Databases 


A database is viewed as consisting of a large collection of tuples. A typical tuple may, for 
example, give an employee’s name, manager, gender, address, social security number, salary, 
and job title. A standard query in a database may have the form: tell me all employees who 
are programmer analysts, earn a salary of less than $100,000, and work for a manager earning 
a salary greater than $150,000. Causality in databases aims to answer the question: Given 
a query and a particular output of the query, which tuples in the database caused that out- 
put? This is of interest because we may be, for example, interested in explaining unexpected 
answers to a query. Tim Burton is a director well known for fantasy movies involving dark 
Gothic schemes, such as “Edward Scissorhands” and “Beetlejuice”. A user wishing to learn 
more about his work might query the IMDB movie database (www.imdb.com) about the gen- 
res of movies that someone named Burton directed. He might be surprised (and perhaps a little 
suspicious of the answer) when he learns that one genre is “Music and Musical”. He would 
then want to know which tuple (i.e., essentially, which movie) caused that output. He could 
then check whether there is an error in the database, or perhaps other directors named Burton 
who directed musicals. It turns out that Tim Burton did direct a musical (“Sweeney Todd”); 
there are also two other directors named Burton (David Burton and Humphrey Burton) who 
directed other musicals. 

Another application of the causality definition is for diagnosing network failures. Consider 
a computer consisting of a number of servers, with physical links between them. Suppose that 
the database contains tuples of the form (Link(zx, y), Active(x), Active(y)). Given a server 
x, Active(x) is either 0 or 1, depending on whether « is active; similarly Link(, y) is either 
0 or | depending on whether the link between x and y is up. Servers x and y are connected 
if there is a path of links between them, all of which are up, such that all servers on the path 
are active. Suppose that at some point the network administrator observes that servers Xo 
and yo are no longer connected. This means that the query Connected(o, yo) will return 
“false”. The causality query “Why did Connected(a9, yo) return ‘false’?” will return all 
the tuples (Link(x, y), Active(x), Active(y)) that are on some path from xp to yo such that 
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either server x is not active, server y is not active, or the link between them is down. These 
are the potential causes of Connected (ag, yo) returning “false”. 

I now briefly sketch how this is captured formally. As I said, a database D is taken to be 
a collection of tuples. For each tuple ¢, there is a binary variable X;, which is 0 if the tuple 
t is not in the database of interest and | if it is. We are interested in causal models where 
the context determines the database. That is, the context determines which variables X, are | 
and which are 0; there are no interesting structural equations. There is assumed to be a query 
language £° that is richer than the language £(S) introduced in Section 2.2.1. Each formula 
y € L® has a free variable «x that ranges over tuples. There is a relation / such that, for each 
tuple t and formula y € L®, either D — y(t) or D E —¢(t) (but not both). If D - y(t), 
then t is said to satisfy the query y in database D. In response to a query yy, a database D 
returns all the tuples ¢ such that D F y(t). 

We can easily build a causal language based on £° in the spirit of £(S). That is, instead of 
starting with Boolean combinations of formulas of the form X = x, we start with formulas of 
the form y(t), where y € L° and t is a tuple in addition to formulas of the form X; = i for 
i € {0,1}. If M is a causal model for databases, since each causal setting (/, v) determines 
a database, (IW, tv) — y(t) exactly if D —/ y(t), where D is the database determined by 
(M, i) and (M,i) / X; = 1 exactly if t © D. We can then define [X < Z]y just as in 
Section 2.2.1. 

Of course, for causality queries to be truly useful, it must be possible to compute them 
efficiently. To do this, a slight variant of the updated HP definition has been considered. Say 
that tuple t is a cause of the answer t’ for p if 


D1. (M,i) —E (X% = 1) A vit’) (sot € Dand D F p(t’), where D is the database 
determined by (MM, ti)); and 


D2. there exists a (possibly empty) set T of tuples contained in D such that t ¢ T, 
D-TKE g(t’), and D— (TU {t}) FE 7y(t’). 


If the tuple t is a cause of the answer t’ for y in (MM, w), then X; = 1 is a cause of y(t’) 

in (M,%). D1 is just AC1. Translating to the kinds of formulas that we have been using all 
oe and taking Xr to consist of all the variables X, such that t” € T, the second clause 
says: (M, u) K [Xr < Oly(t’) and (M, a) — [Xr « 0, X; ¢ O]-y/(t’). Thus, it is saying 
that (Xr, 0, 0) is a witness to the fact that X, = 1 is a cause of y(t’) according to the original 
HP definition; AC2(b°) holds. Note that AC3 is satisfied vacuously. 

This does not necessarily make X; = 0 a cause of y(t’) according to the updated HP def- 
inition. There may be a subset T’ of T' such that (M, i) / [Xr- < 0]7y/(t’), in which case 
AC2(b“) would fail. Things are better if we restrict to monotone queries, that is, queries ~ 
such that for all t’, if D — ¢(t’) and D C D’, then D’ — 7(t’) (in words, if w(t’) holds for 
a small database, then it is guaranteed to hold for larger databases, with more tuples). Mono- 
tone queries arise often in practice, so proving results for monotone queries is of practical 
interest. If y is a monotone query, then AC2(b“) holds as well: if (M,Z) E [Xr < O]y(t’), 
then for all subsets T’ of T, we must have (M,i) K [Xr < O]y(t’) (because (M, i) — 
[Xr + O]y(t’) iff D—T’ — y(t’), and if T’ C T, then D—T C D—T", so we can 
appeal to monotonicity). It is also easy to see that for a monotone query, this definition also 
guarantees that X; = 1 is part of a cause of y(t’) according to the modified HP definition. 
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The second clause has some of the flavor of responsibility; we are finding a set such that 
removing this set from D results in t being critical for y(t’). Indeed, we can define a notion 
of responsibility in databases by saying that tuple t has degree of responsibility 1/(k +1) for 
the answer t’ for if k is the size of the smallest set T for which the definition above holds. 

Notice that I said “if t is a cause of the answer ¢t’ for y in (M,%), then X, = lisa 
cause of y(t’) in (MW, uv)”; I did not say “if and only if’. The converse is not necessarily 
true. The only witnesses (W, w, 0) that this definition allows are ones where w = 0. That 
is, in determining whether the counterfactual clause holds, we can consider only witnesses 
that involve removing tuples from the database, not witnesses that might add tuples to the 
database. This can be captured in terms of a normality ordering that makes it abnormal to add 
tuples to the database but not to remove them. However, the motivation for this restriction 
is not normality as discussed in Chapter 3, but computational. For causality queries to be of 
interest, it must be possible to compute the answers quickly, even in huge databases, with 
millions of tuples. This restriction makes the computation problem much simpler. Indeed, 
it is known that for conjunctive queries (i.e., queries that can be written as conjunctions of 
basic queries about components of tuples in databases—I omit the formal definitions here), 
computing whether ¢ is an actual cause of y~ can be done in time polynomial in the size 
of the database, which makes it quite feasible. Importantly, conjunctive queries (which are 
guaranteed to be monotone) arise frequently in practice. 


8.3. Program Verification 


In model checking, the goal is to check whether a program satisfies a certain specification. 
When a model checker says that a program does not satisfy its specification, it typically pro- 
vides in addition a counterexample that demonstrates the failure of the specification. Such a 
counterexample can be quite useful in helping the programmer understand what went wrong 
and in correcting the program. In many cases, however, understanding the counterexample 
can be quite challenging. 

The problem is that counterexamples can be rather complicated objects. Specifications typ- 
ically talk about program behavior over time, saying things like “eventually » will happen’, 
“a will never happen’, and “property y’ will hold at least until v’ does”. A counterexam- 
ple is a path of the program; that is, roughly speaking, a possibly infinite sequence of tuples 
(@1,...,@n), where each tuple describes the values of the variables X),...,X,. Trying to 
understand from the path why the program did not satisfy its specification may not be so 
simple. As I now show, having a formal notion of causality can help in this regard. 

Formally, we associate with each program a Kripke structure, a graph with nodes and 
directed edges. Each node is labeled with a possible state of the program. If X1,...,X,, are 
the variables in the program, then a program state just describes the values of these variables. 
For simplicity in this discussion, I assume that X,...,X,, are all binary variables, although 
it is straightforward to extend the approach to non-binary variables (as long as they have finite 
range). The edges in the Kripke structure describe possible transitions of the program. That 
is, there is an edge from node v, to node v2 if the program can make a transition from the 
program state labeling v, to that labeling v2. The programs we are interested in are typically 
concurrent programs, which are best thought of as a collection of possibly communicating 
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programs, each of which is run by a different agent in the system. (We can think of the order 
of moves as being determined exogenously.) This means that there may be several possible 
transitions from a given node. For example, at a node where X and Y both have value 0, one 
agent’s program may flip the value of X and another agent’s program may flip the value of Y. 
If these are the only variables, then there would be a transition from the node labeled (0, 0) to 
nodes labeled (1, 0) and (0, 1). 


A path m = (80, $1, 52,-..) in M is a sequence of states such that each state s; is associated 
with a node in and for all 2, there is a directed edge from the node associated with s; to the 
node associated with s;;,. Although the use of the word “state” here is deliberately intended 
to suggest “program state”, technically they are different. Although two distinct states s; and 
s; on a path might be associated with the same node (and thus the same program state), it is 
useful to think of them as being different so that we can talk about different occurrences of 
the same program state on a path. That said, I often blur the distinction between state and 
program state. 


Kripke structures have frequently been used as models of modal logics, logics that extend 
propositional or first-order logic by including modal operators such as operators for belief or 
knowledge. For example, in a modal logic of knowledge, we can make statements like “Alice 
knows that Bob knows that p is true”. Temporal logic, which includes operators like “even- 
tually”, “always”, and “until”, has been found to be a useful tool for specifying properties 
of programs. For example, the temporal logic formula [LX says that X is always true (i.e., 
X is true of a path 7 if X = 1 at every state in 7). All the specifications mentioned above 
(“eventually y will happen”, “a will never happen’, and “property vy’ will hold at least until 
w’ does”) can be described in temporal logic and evaluated on the paths of a Kripke structure. 
That is, there is a relation such that, given a Kripke structure (7, a path 7, and a temporal 
logic formula y, either (IV, 7) - y or (M,7) — 79, but not both. The details of temporal 
logic and the | relation are not necessary for the following discussion. All that we need is 
that (WV, (so, $1, $2,---)) FE Oy, where ¢ is a propositional formula, if y is true at each state 
Sj. 


A model checker may show that a program characterized by a Kripke structure // fails to 
satisfy a particular specification y by returning (a compact description of) a possibly infinite 
path 7 in M that satisfies -y. The path 7 provides a counterexample to / satisfying y. As 
I said, although having such a counterexample is helpful, a programmer may want a deeper 
understanding of why the program failed to satisfy the specification. This is where causality 
comes in. 


Beer, Ben-David, Chocker, Orni, and Trefler give a definition of causality in Kripke struc- 
tures in the spirit of the HP definition. Specifically, they define what it means for the fact that 
X = «x in the state s; on a path 7 to be a cause of the temporal logic formula vy not holding 
in path 7. (As an aside, rather than considering the value of the variable X in state s, for each 
state s and variable X in the program, we could have a variable X,, and just ask whether 
Xs, = xis acause of y. Conceptually, it seems more useful to think of the same variable as 
having different values in different states, rather than having many different variables X,.) As 
with the notion of causality in databases, the language for describing what can be caused is 
richer than the language £(S), but the basic intuitions behind the definition of causality are 
unchanged. 
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To formalize things, I need a few more definitions. Since we are dealing with binary 
variables, in a temporal logic formula, I write X rather than X = 1 and —X rather than 
X = 0. For example, I write X A “Y rather than X = 1A Y = 0. Define the polarity 
of an occurrence of a variable X in a formula y to be positive if X occurs in the scope 
of an even number of negations and negative if X occurs in the scope of an odd number 
of negations. Thus, for example, the first occurrence of X in a temporal formula such as 
a0(7-X AY)V 7X has positive polarity, the only occurrence of Y has negative polarity, and 
the second occurrence of X has negative polarity. (The temporal operators are ignored when 
computing the polarity of a variable.) 

A pair (s, X) consisting of a state s and a variable X is potentially helpful for vp if either 
there exists at least one occurrence of X in y that has positive polarity and X = 0 in s or 
there exists at least one occurrence of X in ¢ that has negative polarity and X = 1 in s. It 
is not hard to show that if (s, X) is not potentially helpful for vy, then changing the value of 
X in s cannot change the truth value of y from false to true in 7. (This claim is stated more 
formally below.) 

Given distinct pairs (51, X1),..., (8%, Xx) and a path 7, let 7[(s1, X1),..., (8%, Xx)] be 
the path that is just like 7 except that the value of X; in state s; is flipped, fori = 1,...,k. 
(We can think of z[(s,,X1),..., (8%, X~)] as the result of intervening on 7 so as to flip the 
values of the variable X; in state i, for? = 1,...,k.) Thus, the previous observation says that 
if (s, X) is not potentially helpful for y and (IV, 7) = 79, then (M, z[(s, X)]) E 79. 

Suppose that 7 is a counterexample to y. A finite prefix p of 7 is said to be a counterexam- 
ple to vif (1, 7’) —E -y for all paths x’ that extend p. Note that even if 7 is a counterexample 
to y, there may not be any finite prefix of 7 that is a counterexample to y. For example, if 
y is the formula ~O(X = 0), which says that X is not always 0, so eventually X = 1, a 
counterexample 7 is a path such that X = 0 at every state on the path. However, for every 
finite prefix p of z, there is an extension 7’ of p that includes a state where X = 1, so pis not 
a counterexample. 


A prefix p of 7 is said to be a minimal counterexample to ¢ if it is a counterexample to 
y and there is no shorter prefix that is a counterexample to y. If no finite prefix of 7 is a 
counterexample to y, then 7 itself is a minimal counterexample to y. 

We are finally ready for the definition of causality. 


Definition 8.3.1 If is a counterexample to y, then (s, X) (intuitively, the value of X in 
state s) is a cause of the first failure of p on 7 if 


Col. there is a prefix p of 7 (which could be z itself) such that p is a minimal counterexample 
to y and s is a state on p; and 


Co2. there exist distinct pairs (s,,X1),...,(s,%,X,) potentially helpful for y such that 
(M, p[(si, X1), Seite (sk, Xx)]) E 71Y~ and (M, p[(s,X), (s1,X1), ee) (sk, Xx)]) 


Condition Co2 in Definition 8.3.1 is essentially a combination of AC2(a) and A2(b°), if 
we assume that the value of all variables is determined exogenously. Clearly the analogue 
of AC2(a) holds: changing the value of X in state s flips ~ from true to false, under the 
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contingency that the value of X; is flipped in state s;, for? = 1,...,k4. Moreover, AC2(b?) 
holds, since leaving X at its value in state s while flipping X,,..., X; keeps ¢ false. 
To get an analogue of AC2(b“), we need a further requirement: 


Co3. (M, pl(si, Xjn)s--+5 (See Xie ))) E Wit {Xj Xj } © (Kay. Kab 


Finally, to get an analogue of AC2(a™), we need yet another condition: 


Cot. LM, pls. X 418i, Ra ys kag Sepa IY HOM ys ieey Mie PS Aye 


I use strict subset in Co4 since Co2 requires that (I, p[(s, X), (1, X1),---,(sk,Xzk)]) EY. 
If Co2—4 hold, then the values of X, X,,..., X;, in state s together form a cause of yp accord- 
ing to the modified HP definition, and the value of X is part of a cause of y. 

Condition Col in Definition 8.3.1 has no analogue in the HP definition of actual causality 
(in part, because there is no notion of time in the basic HP framework). It was added here 
because knowing about the first failure of y seems to be of particular interest to programmers. 
Also, focusing on one failure reduces the set of causes, which makes it easier for programmers 
to understand an explanation. Of course, the definition would make perfect sense without this 
clause. 


Example 8.3.2 Suppose that (the program state labeling) node v satisfies X = 1 and node 
vu’ satisfies X = 0. Assume that v and v’ are completely connected in the Kripke structure: 
there is an edge from v to itself, from v to v’, from v’ to itself, and from v’ to v. Consider the 
path 7 = (so, $1, S2,...), where all states on the path are associated with v except s2, which 
is associated with uv’. Clearly (M,7) / 7OX. Indeed, if p = (so, $1, 52), then OX already 
fails in p. Note that (s2, X) is potentially helpful for O.-X (since X has positive polarity in 
X and X = 0 in s9). It is easy to check that (s2, X) is a cause of the first failure of 0X 
in 7. 

Now consider the formula -OX. The path 7’ = (89, 51, 55,...) where each state s/; is 
associated with v is a counterexample to =X, but no finite prefix of 7’ is a counterexample; 
it is always possible to extend a finite prefix so as to make =0.X true (by making X = 0 at 
some state in the extension). The value of X in each of the states on 7’ is a cause of ~DO.X 
failing. ff 


Beer, Ben-David, Chocker, Orni, and Trefler built a tool that implemented these ideas; they 
report that programmers found it quite helpful. 

Just as in the database case, the complexity of computing causality becomes significant in 
this application. In general, the complexity of computing causality in counterexamples is NP- 
complete. However, given a finite path p and a specification ¢ that fails in all paths extending 
p, there is an algorithm that runs in time polynomial in the length of p and y (viewed as a 
string of symbols) that produces a superset of the causes of y failing in p. Moreover, in many 
cases of practical interest, the set produced by the algorithm is precisely the set of causes. 
This algorithm is actually the one implemented in the tool. 

Causality also turns out to be useful even if a program does satisfy its specification. In that 
case, a model checker typically provides no further information. The implicit assumption was 
that if the model checker says that a program satisfies its specification, then programmers are 
happy to leave well enough alone; they feel no need to further analyze the program. However, 
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there has been growing awareness that further analysis may be necessary even if a model 
checker reports that a program satisfies its specification. The concern is that the reason that the 
program satisfies its specification is due to an error in the specification itself (the specification 
may not cover an important case). One way to minimize this concern is to try to understand 
what causes the program to satisfy its specification. And one useful way of formalizing this 
turns out to be to examine the degree of responsibility that the value of variable X in state 
s (i.e., a pair (s, X), in the notation of the earlier discussion) has for the program satisfying 
the specification. If there are states s such that (s, X) has a low degree of responsibility for 
a specification y for all variables X, this suggests that s is redundant. Similarly, if there is 
a variable X such that (s,X) has a low degree of responsibility for y for all states s, this 
suggests that X is redundant. In either case, we have a hint that there may be a problem with 
the specification: The programmer felt that state s (or variable X) was important, but it is not 
carrying its weight in terms of satisfying the specification. Understanding why this is the case 
may uncover a problem. 

Again, while the general problem of computing degree of responsibility in this context is 
hard, there are tractable special cases of interest. It is necessary to identify these tractable 
cases and argue that they are relevant in practice for this approach to be considered worth- 
while. 


8.4 Last Words 


These examples have a number of features that are worth stressing. 


= The language used is determined by the problem in a straightforward way, as is the 
causal model. That means that many of the subtle modeling issues discussed in Chap- 
ter 4 do not apply. Moreover, adding variables so as to convert a cause according to 
the updated HP definition to a cause according to the original HP definition, as in Sec- 
tion 4.3, seems less appropriate. There is also a clear notion of normality when it comes 
to accountability (following the prescribed program) and model checking (satisfying 
the specification). In the database context, normality plays a smaller role, although, as 
I noted earlier, the restriction on the witnesses considered can be viewed as saying that 
it is abnormal to add tuples to the database. More generally, it may make sense to view 
removing one tuple from the database as less normal than removing a different tuple. 


= In Chapter 5, I raised computational concerns on the grounds that if causality is hard to 
compute and structural-equations models are hard to represent, then it seems implausi- 
ble that this is how people evaluate causality. The examples that people work with are 
typically quite small. In the applications discussed in the chapter, the causal models 
may have thousands of variables (or even millions, in the case of databases). Now com- 
putational concerns are a major issue. Programs provide a compact way to represent 
many structural equations, so they are useful not only conceptually, but to deal with 
concerns regarding compact representations. And, as I have already observed, finding 
special cases that are of interest and easy to compute becomes more important. 
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= It is certainly important in all these examples that responses to queries regarding causal- 
ity agree with people’s intuitions. Fortunately, for all the examples considered in the 
literature, they do. The philosophy literature on causality has focused on (arguably, 
obsessed over) finding counterexamples to various proposed definitions; a definition is 
judged on how well it handles various examples. The focus of the applications papers 
is quite different. Finding some examples where the definitions do not give intuitive 
answers in these applications is not necessarily a major problem; the definitions can 
still be quite useful. In the database and program verification applications, utility is the 
major criterion for judging the approach. Moreover, even if a database system provides 
unintuitive answers, there is always the possibility of posing further queries to help 
clarify what is going on. Of course, in the case of accountability, if legal judgments 
are going to be based on what the system says, then we must be careful. But in many 
cases, having a transparent, well-understood definition that gives predictable answers 
that mainly match intuitions is far better than the current approach, where we use poorly 
defined notions of causality and get inconsistent treatment in courts. 


The applications in this chapter illustrate just how useful a good account of causality can 
be. Although the HP definition needed to be tweaked somewhat in these applications (both be- 
cause of the richer languages involved and to decrease the complexity of determining causal- 
ity), these examples, as well as the many other examples in the book, make me confident 
that the HP definition, augmented by a normality ordering if necessary, provides a powerful 
basis for a useful account of causality. Clearly more work can and should be done on refin- 
ing and extending the definitions of causality, responsibility, and explanation. But I suspect 
that looking at the applications will suggest further research questions quite different from the 
traditional questions that researchers have focused on thus far. I fully expect that the coming 
years will give us further applications and a deeper understanding of the issues involved. 


Notes 


Lombrozo [2010] provides further evidence regarding the plethora of factors that affect 
causality ascriptions. 

Feigenbaum, Jaggard, and Wright [2011] point out the role of causality in accountabil- 
ity ascriptions and suggest using the HP definition. Datta et al. [2015] explore the role of 
causality in accountability in greater detail. Although their approach to causality involves 
counterfactuals, there are some significant differences from the HP definition. Among other 
things, they use sufficient causality rather than what I have been calling “actual causality”. 

Much of the material on the use of causality in databases is taken from the work of Meliou, 
Gatterbauer, Halpern, Koch, Moore, and Suciu [2010]. The complexity results for queries 
mentioned in the discussion (as well as other related results) are proved by Meliou, Gatter- 
bauer, Moore, and Suciu [2010]. 

As I said in Section 8.3, Beer et. al. [2012] examined the use of causality in explaining 
counterexamples to the specification of programs. The causality computation was imple- 
mented in a counterexample explanation tool for the IBM hardware model checker RuleBase 
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[Beer et al. 2012]. Programmers found the tool tremendously helpful. In fact, when it was 
temporarily disabled between releases, users called and complained about its absence [Chock- 
ler 2015]. 

Chockler, Halpern, and Kupferman [2008] considered the use of causality and responsi- 
bility to check whether specifications were appropriate, or if the programmer was trying to 
satisfy a specification different from the one that the model checker was checking. They 
provided a number of tractable cases of the problem. 

Chockler, Grumberg, and Yadgar [2008] consider another application of responsibility (in 
the spirit of the HP definition) to model checking: refinement. Because programs can be so 
large and complicated, model checking can take a long time. One way to reduce the runtime 
is to start with a coarse representation of the program, where one event represents a family of 
events. Model checking a coarse representation is much faster than model checking a refined 
representation, but it may not produce a definitive answer. The representation is then refined 
until a definitive answer is obtained. Chockler, Grumberg, and Yadgar [2008] suggested that 
the way that the refinement process works could be guided by responsibility considerations. 
Roughly speaking, the idea is to refine events that are most likely to be responsible for the 
outcome. Again, they implemented their ideas in a model-checking tool and showed that it 
performed very well in practice. 

Groce [2005] also used a notion of causality based on counterfactuals to explain errors in 
programs, although his approach was not based on the HP definitions. 
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